google / localized-narratives Goto Github PK

View Code? Open in Web Editor NEW

79.0 10.0 14.0 9.63 MB

Localized Narratives

Home Page: https://google.github.io/localized-narratives/

License: Apache License 2.0

HTML 81.29% Python 18.71%

computer-vision image-captioning speech-analysis

localized-narratives's People

Stargazers

Watchers

Forkers

idejie neotim ankitsy muskanmahajan37 regalius modestgoblin isabella232 urlocalvegan iremeyiokur vincy2king ghas-results

localized-narratives's Issues

Data download link broken

https://google.github.io/localized-narratives/

The download for all datasets are broken.

This XML file does not appear to have any style information associated with it. The document tree is shown below.
<Error>
<Code>AccessDenied</Code>
<Message>Access denied.</Message>
<Details>Anonymous caller does not have storage.objects.get access to the Google Cloud Storage object.</Details>
</Error>

trace data

Hi, could you explain why the 'traces' is organized as List[List[TimePoint]] instead of List[TimePoint]?

Dataset Download

Hi @jponttuset

I am unable to download the raw voice recordings in the dataset. There is no such path, as mentioned in the documentation.

https://storage.googleapis.com/localized-narratives
While accessing the above URL, I am getting the following error:

<Error>
<Code>NoSuchKey</Code>
<Message>The specified key does not exist.</Message>
<Details>No such object: localized-narratives/voice-recordings/</Details>
</Error>

Controlled Captioning Baseline

Hi, thank you for your impressive work!
I'm currently building my captioning model on this dataset. And I'm wondering whether can you provide your Controlled Image Captioning baseline code? I believe that will lead to more fair comparable works on this topic and will boost the influence of this dataset.

Are x and y coordinates given with the upper left corner as origin?

Like title says, where is the origin of the x and y coordinate system in the traces in the .jsonl files? I could not find this on your website or in your paper. Maybe I didn't look carefully? Else, I think it would make sense to mention it somewhere.

voice recording downloading

Is there a collection or archive of all the voice recording ?

Error in coco data

Hi,
While I run the demo.py file, I found a data error in coco val dataset.
In coco_val_localized_narratives.jsonl , line 4414, the case lacks traces data.

`{
"dataset_id":"mscoco_val2017",
"image_id":"381639",
"annotator_id":89,
"caption":"In the image we can see girl standing and holding a doll in her hand. These are the road cone. There are even other people who are getting into airplane, there is a building. This is a tree and sky.",
"timed_caption":[
{
"utterance":"In the",
"start_time":0,
"end_time":0
},
{
"utterance":"image",
"start_time":0,
"end_time":4.4
},
{
"utterance":"we",
"start_time":4.4,
"end_time":5
},
{
"utterance":"can",
"start_time":5,
"end_time":5.3
},
{
"utterance":"see",
"start_time":5.3,
"end_time":5.5
},
{
"utterance":"girl",
"start_time":5.5,
"end_time":6.3
},
{
"utterance":"standing",
"start_time":6.3,
"end_time":6.9
},
{
"utterance":"and",
"start_time":6.9,
"end_time":7.6
},
{
"utterance":"holding",
"start_time":7.6,
"end_time":8.1
},
{
"utterance":"a",
"start_time":8.1,
"end_time":8.2
},
{
"utterance":"doll",
"start_time":8.2,
"end_time":8.5
},
{
"utterance":"in",
"start_time":8.5,
"end_time":9.1
},
{
"utterance":"her",
"start_time":9.1,
"end_time":9.3
},
{
"utterance":"hand.",
"start_time":9.3,
"end_time":9.5
},
{
"utterance":"These",
"start_time":9.5,
"end_time":10.5
},
{
"utterance":"are",
"start_time":10.5,
"end_time":10.7
},
{
"utterance":"the",
"start_time":10.7,
"end_time":10.9
},
{
"utterance":"road",
"start_time":10.9,
"end_time":11.2
},
{
"utterance":"cone.",
"start_time":11.2,
"end_time":11.6
},
{
"utterance":"There",
"start_time":11.6,
"end_time":12.2
},
{
"utterance":"are",
"start_time":12.2,
"end_time":12.3
},
{
"utterance":"even",
"start_time":12.3,
"end_time":12.8
},
{
"utterance":"other",
"start_time":12.8,
"end_time":13
},
{
"utterance":"people",
"start_time":13,
"end_time":13.5
},
{
"utterance":"who",
"start_time":13.5,
"end_time":13.5
},
{
"utterance":"are",
"start_time":13.5,
"end_time":14
},
{
"utterance":"getting",
"start_time":14,
"end_time":14.5
},
{
"utterance":"into",
"start_time":14.5,
"end_time":15.2
},
{
"utterance":"airplane,",
"start_time":15.2,
"end_time":15.8
},
{
"utterance":"there",
"start_time":15.8,
"end_time":16.7
},
{
"utterance":"is",
"start_time":16.7,
"end_time":16.9
},
{
"utterance":"a",
"start_time":16.9,
"end_time":17.1
},
{
"utterance":"building.",
"start_time":17.1,
"end_time":17.6
},
{
"utterance":"This",
"start_time":17.6,
"end_time":18.5
},
{
"utterance":"is",
"start_time":18.5,
"end_time":18.8
},
{
"utterance":"a",
"start_time":18.8,
"end_time":18.9
},
{
"utterance":"tree",
"start_time":18.9,
"end_time":19.4
},
{
"utterance":"and",
"start_time":19.4,
"end_time":19.9
},
{
"utterance":"sky.",
"start_time":19.9,
"end_time":20.2
}
],
"traces":[

],
"voice_recording":"coco_val/coco_val_381639_89.ogg"

Every word should have an individual time span.

I quote from the paper that "Note that µ assigns each ai to exactly one mj , but mj can match to zero or multiple words in a". and refer you to the formal definition for how the start_time and end_time is derived for a word mj. This stands to reason that every individual word should have its own individual time span. However, in the sample data

{ dataset_id: 'mscoco_val2017', image_id: '137576', annotator_id: 93, caption: 'In this image there are group of cows standing and eating th...', timed_caption: [{'utterance': 'In this', 'start_time': 0.0, 'end_time': 0.4}, ...], traces: [[{'x': 0.2086, 'y': -0.0533, 't': 0.022}, ...], ...], voice_recording: 'coco_val/coco_val_137576_93.ogg' }

it is shown that two words 'In this' shares the same time window which is contrary to how the paper describes the start_time and end_time would be assigned... that is to individual words.

Yet, it is correct in other parts of the dataset such as the following

{ dataset_id: ADE20k, image_id: ADE_val_00000175, annotator_id: 125, caption: In this image on the left side I can see a bed and a window...., timed_caption: [{'utterance': 'In', 'start_time': 0.0, 'end_time': 0.0}, {'utterance': 'this', 'start_time': 0.0, 'end_time': 0.8}, ...], traces: [[{'x': 0.6408, 'y': 0.1371, 't': 0.013}, ...], ...], voice_recording: ade20k_validation/ade20k_validation_ADE_val_00000175_125.ogg }

where 'In' and 'this' each have their own start_time and end_time.

I would appreciate if you could help me shed light on this disparity. Sorry if this is a mistake on my part and I thank you for reading my message.

Yours Sincerely,
Gordon

Annotation Code available

Hi,

would you have the annotation codebase available? I would like to collect similar data for my work.

Thanks

Question about controlled caption evaluation

Hi, I have a question about the evaluation process.
Since in the validation set, an image may have multiple traces according to your annotation. Which trace is taken as input? If we take each trace as an individual item to evaluate, we will get multiple hypotheses for a single image id, which is inconsistent with the default setting in the MS coco caption evaluation tool.
A more detailed explanation on how to reproduce the reported results is a great help.
Many thanks for considering my request.

google / localized-narratives Goto Github PK

localized-narratives's People

Stargazers

Watchers

Forkers

localized-narratives's Issues

Data download link broken

trace data

Dataset Download

Controlled Captioning Baseline

Are x and y coordinates given with the upper left corner as origin?

voice recording downloading

Error in coco data

Every word should have an individual time span.

Annotation Code available

Question about controlled caption evaluation

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent