cshizhe / hgr_v2t Goto Github PK
View Code? Open in Web Editor NEWCode accompanying the paper "Fine-grained Video-Text Retrieval with Hierarchical Graph Reasoning".
License: MIT License
Code accompanying the paper "Fine-grained Video-Text Retrieval with Hierarchical Graph Reasoning".
License: MIT License
I found that some files are missing in the data file downloaded from BaiduNetdisk. There are 6 files in MSRVTT/annotation/RET
(int2word.npy
, ref_cpation.json
, sent2rolegraoh.augment.json
, sent2srl.json
and word2int.json
), but some are not found in other dataset. For example, there are only 2 files in MSVD/annotation/RET
(ref_cpation.json
, sent2rolegraoh.augment.json
).
Hi~ thanks for your nice work~
I want to caption a self-captured video, could you please give some detailed instructions on how to adapt the pretrained model provide in the code to finish this task? For example, the feature extraction method, feature data format, and how to visualize the final result? Thanks a lot!
OSError: Unable to open file (unable to open file: name = 'data/VATEX/ordered_feature/SA/resnet152.pth/trn_ft.hdf5'
Could you tell me to use I3D feature?
Hello, thanks for your great work, I'm very interested in visualizing the examples.How can I visualize the retrieved videos?Could you please upload the code?
Hi, I'm very interested in your work, and I want to use other datasets like Charades on your model. But there are several files like annotations which other datasets don't have. What should I do could get these annotations and how to get the role graph? Could you provide your tools metioned in your paper? Thank you very much if you can reply me.
Hi Shizhe,
Thanks for your great work! I noticed in the training script, it needs to load a pre-train model:
--resume_file $resdir/../../word_embeds.glove42b.th
This leads to initialize the text embedding module?
Besides, I can not find this file from "MSRVTT/results/RET.released/" and can only find one "MSRVTT/results/RET/word_embeds.glove32b.th". Is there any difference between word_embeds.glove42b.th and word_embeds.glove32b.th? Could you please share the "word_embeds.glove42b.th"?
The pretrained model "bert-base-srl-2019.06.17.tar.gz" seems not to be applicable for the latest version of allennlp
If there is no verb in the sentence, what should we do with it
Thanks for your great work!
I have a question that how long to train your model on such 3 dataset?
And the BaiduNetdisk is empty.
allennlp.common.checks.ConfigurationError: srl not in acceptable choices for dataset_reader.type
My config code for predictor is
**archive=load_archive('bert-base-srl-2019.06.17')
predictor=Predictor.from_archive(archive,'video-text classifier')**
Can you provide datasets to other domain such as google drive/ dropbox ? To download from Baidu require account and I'm not from China nor have China phone number.
Thank you.
Recently I saw your paper fine-video-text Retrieval with Hierarchical Graph Reasoning. I saw you used Youtube2Text dataset in your paper. However, I did not find the video features and sentence features of Youtube2Text data set in baidu cloud link. Could you please provide me with the download link of Youtube2Text data set? Thank you very much!
There is a doubt in this get data function: why only obtain one caption in a video ?
def getitem(self, idx):
out={}
if self.is_train:
video_idx,cap_idx=self.pair_idxs[idx]
video_name=self.video_names[video_idx]
mp_feature=self.mp_features[video_idx]
sent=self.captions[cap_idx]
cap_ids,cap_len=self.process_sent(sent,self.max_words_embedding)
out['captions_ids']=cap_ids
out['captions_lens']=cap_len
else:
video_name=self.video_names[idx]
mp_feature=self.mp_features[idx]
out['names']=video_name
out['mp_fts']=mp_feature
return out
Hi,
How are the features (MP) used for global matching extracted? Are these obtained by spatio-temporally average pooling the features obtained from ResNet-152 pretrained on ImageNet?
Hi, Shizhe, thanks for your great work. I downloaded the MSR-VTT dataset you provided, and I have a question. I found that not every video corresponds to 20 captions. Some videos only correspond to less than 20 captions. I would like to ask if you specifically selected these captions and how to choose them?
Thank you for your great codes ! And after running your codes in my server for several times, I am surprised to find out that I cannot reproduce your result in paper. The best result of final recall sum that I got in MSRVTT is 170.1 while your paper's result is 172.4, and i did not modify anything of your codes...
Could you please share the best parameters of your codes ? or introduce the solution of my problem ?
Hi, cshizhe
I find that the number_of_feature/video_duration of videos are different, can you tell me the temporal interval of visual features?
Thanks
Hello author, can you provide the original video dataset of MSRVTT?
Hi, Shizhe, thanks for the wonderful work!
For a new dataset, how can I get the word2int.json
, int2word.npy
and word.embedding.glove42.th
?
I assume that you used a Glove
model for word embedding weight initialization.
Could you provide an instruction of it?
Hi, cshizhe, thanks for your great work.
when testing performance on MSRVTT dataset, I found that the performance in different test are same, but the sent_scores, verb_scores and noun_scores were different. I don't know why.
there are some outputs in different test :
.......
tensor(-197.5491, device='cuda:0') tensor(4066.6943, device='cuda:0') tensor(4957.7461, device='cuda:0')
tensor(-172.1141, device='cuda:0') tensor(4193.5151, device='cuda:0') tensor(5157.7603, device='cuda:0')
tensor(-68.0737, device='cuda:0') tensor(1171.2622, device='cuda:0') tensor(1342.9297, device='cuda:0')
tensor(82.5919, device='cuda:0') tensor(4531.4185, device='cuda:0') tensor(5212.8369, device='cuda:0')
tensor(-43.9712, device='cuda:0') tensor(4319.0312, device='cuda:0') tensor(5150.5146, device='cuda:0')
tensor(1.5257, device='cuda:0') tensor(4386.4746, device='cuda:0') tensor(5333.5151, device='cuda:0')
tensor(-22.8292, device='cuda:0') tensor(1247.3308, device='cuda:0') tensor(1393.1257, device='cuda:0')
tensor(23.0804, device='cuda:0') tensor(1473.0065, device='cuda:0') tensor(1647.1292, device='cuda:0')
tensor(-31.6811, device='cuda:0') tensor(1406.5350, device='cuda:0') tensor(1616.0713, device='cuda:0')
tensor(-41.8293, device='cuda:0') tensor(1422.7487, device='cuda:0') tensor(1656.0972, device='cuda:0')
tensor(-10.5121, device='cuda:0') tensor(397.1695, device='cuda:0') tensor(444.0505, device='cuda:0')
ir1,ir5,ir10,imedr,imeanr,imAP,cr1,cr5,cr10,cmedr,cmeanr,cmAP,rsum
ir5-rsum,epoch.28.th,22.89,51.07,63.17,5.00,40.16,36.14,22.30,51.10,62.90,5.00,39.20,35.62,273.43
different time:
........
tensor(-89.9776, device='cuda:0') tensor(4095.6599, device='cuda:0') tensor(5116.2510, device='cuda:0')
tensor(-145.8661, device='cuda:0') tensor(4161.9165, device='cuda:0') tensor(5351.6670, device='cuda:0')
tensor(-40.3292, device='cuda:0') tensor(1177.1305, device='cuda:0') tensor(1314.6021, device='cuda:0')
tensor(-58.3337, device='cuda:0') tensor(4536.5352, device='cuda:0') tensor(4928.3350, device='cuda:0')
tensor(35.2728, device='cuda:0') tensor(4343.3838, device='cuda:0') tensor(5280.2969, device='cuda:0')
tensor(2.8130, device='cuda:0') tensor(4361.0112, device='cuda:0') tensor(5508.0010, device='cuda:0')
tensor(37.5651, device='cuda:0') tensor(1243.3253, device='cuda:0') tensor(1373.3599, device='cuda:0')
tensor(-25.2279, device='cuda:0') tensor(1490.6547, device='cuda:0') tensor(1566.4670, device='cuda:0')
tensor(7.1009, device='cuda:0') tensor(1408.7480, device='cuda:0') tensor(1670.6154, device='cuda:0')
tensor(-34.9750, device='cuda:0') tensor(1403.9734, device='cuda:0') tensor(1701.3884, device='cuda:0')
tensor(-7.8403, device='cuda:0') tensor(396.0836, device='cuda:0') tensor(424.8773, device='cuda:0')
ir1,ir5,ir10,imedr,imeanr,imAP,cr1,cr5,cr10,cmedr,cmeanr,cmAP,rsum
ir5-rsum,epoch.28.th,22.89,51.07,63.17,5.00,40.16,36.14,22.30,51.10,62.90,5.00,39.20,35.62,273.43
Hi, when I generate my own rolegraph, there are something wrong. With the predictor's model adress https://s3-us-west-2.amazonaws.com/allennlp/models/bert-base-srl-2019.06.17.tar.gz
you provided in semantic_role_labeling.py
, I got the predictor's output like {'verbs': [{'verb': 'talks', 'description': 'a woman talks about a futuristic bicycle design', 'tags': ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']}], 'words': ['a', 'woman', 'talks', 'about', 'a', 'futuristic', 'bicycle', 'design']}
, all tags are O
. So are there someting wrong with the model? I try other models like https://storage.googleapis.com/allennlp-public-models/bert-base-srl-2020.03.24.tar.gz
, which is used in semantic roles labeling provided in https://demo.allennlp.org/semantic-role-labeling/MjMyODEwNg==
, it works correctly, the output is {'verbs': [{'verb': 'is', 'description': 'someone [V: is] blowing a little boys face with a leaf blower', 'tags': ['O', 'B-V', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']}, {'verb': 'blowing', 'description': '[ARG0: someone] is [V: blowing] [ARG1: a little boys face] [ARGM-MNR: with a leaf blower]', 'tags': ['B-ARG0', 'O', 'B-V', 'B-ARG1', 'I-ARG1', 'I-ARG1', 'I-ARG1', 'B-ARGM-MNR', 'I-ARGM-MNR', 'I-ARGM-MNR', 'I-ARGM-MNR']}, {'verb': 'face', 'description': 'someone is blowing [ARG0: a little boys] [V: face] with a leaf blower', 'tags': ['O', 'O', 'O', 'B-ARG0', 'I-ARG0', 'I-ARG0', 'B-V', 'O', 'O', 'O', 'O']}], 'words': ['someone', 'is', 'blowing', 'a', 'little', 'boys', 'face', 'with', 'a', 'leaf', 'blower']}
.
Hi cshizhe.
In your paper, video-to-text retrieval results of all methods on TGIF are much lower than the results in the PVSE paper.
Because there is no description about the result, I can't understand the discrepancy of the results.
Can you explain about this?
I do have to train your code on TGIF and get the result but I think it's more certain.
Thank you in advance.
The MSVD dataset has no word2int.json and int2word.npy file. Could you give me a new link�?
Hi,
I click the BaiduNetdisk url, but it appears the following information:
此链接分享内容可能因为涉及侵权、色情、反动、低俗等信息,无法访问!
The biadu link for annotations, pretrained features is gone.
I find that the vatex dataset you used in hgr is VATEX v1.0 which does not provide the annotations on testing set.
Then you randomly split the validation set into two equal parts with 1,500 videos as validation set and other 1,500 videos as testing set.
I want to follow your dataset partitioning, but i can not find any split information in this repo.
Could you please provide the 'csv' or 'json' files of vatex dataset which contain the partition information.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.