google / localized-narratives Goto Github PK
View Code? Open in Web Editor NEWLocalized Narratives
Home Page: https://google.github.io/localized-narratives/
License: Apache License 2.0
Localized Narratives
Home Page: https://google.github.io/localized-narratives/
License: Apache License 2.0
https://google.github.io/localized-narratives/
The download for all datasets are broken.
This XML file does not appear to have any style information associated with it. The document tree is shown below.
<Error>
<Code>AccessDenied</Code>
<Message>Access denied.</Message>
<Details>Anonymous caller does not have storage.objects.get access to the Google Cloud Storage object.</Details>
</Error>
Hi, could you explain why the 'traces' is organized as List[List[TimePoint]] instead of List[TimePoint]?
Hi @jponttuset
I am unable to download the raw voice recordings in the dataset. There is no such path, as mentioned in the documentation.
https://storage.googleapis.com/localized-narratives
While accessing the above URL, I am getting the following error:
<Error>
<Code>NoSuchKey</Code>
<Message>The specified key does not exist.</Message>
<Details>No such object: localized-narratives/voice-recordings/</Details>
</Error>
Hi, thank you for your impressive work!
I'm currently building my captioning model on this dataset. And I'm wondering whether can you provide your Controlled Image Captioning baseline code? I believe that will lead to more fair comparable works on this topic and will boost the influence of this dataset.
Like title says, where is the origin of the x and y coordinate system in the traces in the .jsonl files? I could not find this on your website or in your paper. Maybe I didn't look carefully? Else, I think it would make sense to mention it somewhere.
Is there a collection or archive of all the voice recording ?
Hi,
While I run the demo.py file, I found a data error in coco val dataset.
In coco_val_localized_narratives.jsonl , line 4414, the case lacks traces data.
`{
"dataset_id":"mscoco_val2017",
"image_id":"381639",
"annotator_id":89,
"caption":"In the image we can see girl standing and holding a doll in her hand. These are the road cone. There are even other people who are getting into airplane, there is a building. This is a tree and sky.",
"timed_caption":[
{
"utterance":"In the",
"start_time":0,
"end_time":0
},
{
"utterance":"image",
"start_time":0,
"end_time":4.4
},
{
"utterance":"we",
"start_time":4.4,
"end_time":5
},
{
"utterance":"can",
"start_time":5,
"end_time":5.3
},
{
"utterance":"see",
"start_time":5.3,
"end_time":5.5
},
{
"utterance":"girl",
"start_time":5.5,
"end_time":6.3
},
{
"utterance":"standing",
"start_time":6.3,
"end_time":6.9
},
{
"utterance":"and",
"start_time":6.9,
"end_time":7.6
},
{
"utterance":"holding",
"start_time":7.6,
"end_time":8.1
},
{
"utterance":"a",
"start_time":8.1,
"end_time":8.2
},
{
"utterance":"doll",
"start_time":8.2,
"end_time":8.5
},
{
"utterance":"in",
"start_time":8.5,
"end_time":9.1
},
{
"utterance":"her",
"start_time":9.1,
"end_time":9.3
},
{
"utterance":"hand.",
"start_time":9.3,
"end_time":9.5
},
{
"utterance":"These",
"start_time":9.5,
"end_time":10.5
},
{
"utterance":"are",
"start_time":10.5,
"end_time":10.7
},
{
"utterance":"the",
"start_time":10.7,
"end_time":10.9
},
{
"utterance":"road",
"start_time":10.9,
"end_time":11.2
},
{
"utterance":"cone.",
"start_time":11.2,
"end_time":11.6
},
{
"utterance":"There",
"start_time":11.6,
"end_time":12.2
},
{
"utterance":"are",
"start_time":12.2,
"end_time":12.3
},
{
"utterance":"even",
"start_time":12.3,
"end_time":12.8
},
{
"utterance":"other",
"start_time":12.8,
"end_time":13
},
{
"utterance":"people",
"start_time":13,
"end_time":13.5
},
{
"utterance":"who",
"start_time":13.5,
"end_time":13.5
},
{
"utterance":"are",
"start_time":13.5,
"end_time":14
},
{
"utterance":"getting",
"start_time":14,
"end_time":14.5
},
{
"utterance":"into",
"start_time":14.5,
"end_time":15.2
},
{
"utterance":"airplane,",
"start_time":15.2,
"end_time":15.8
},
{
"utterance":"there",
"start_time":15.8,
"end_time":16.7
},
{
"utterance":"is",
"start_time":16.7,
"end_time":16.9
},
{
"utterance":"a",
"start_time":16.9,
"end_time":17.1
},
{
"utterance":"building.",
"start_time":17.1,
"end_time":17.6
},
{
"utterance":"This",
"start_time":17.6,
"end_time":18.5
},
{
"utterance":"is",
"start_time":18.5,
"end_time":18.8
},
{
"utterance":"a",
"start_time":18.8,
"end_time":18.9
},
{
"utterance":"tree",
"start_time":18.9,
"end_time":19.4
},
{
"utterance":"and",
"start_time":19.4,
"end_time":19.9
},
{
"utterance":"sky.",
"start_time":19.9,
"end_time":20.2
}
],
"traces":[
],
"voice_recording":"coco_val/coco_val_381639_89.ogg"
}`
I quote from the paper that "Note that µ assigns each ai to exactly one mj , but mj can match to zero or multiple words in a". and refer you to the formal definition for how the start_time and end_time is derived for a word mj. This stands to reason that every individual word should have its own individual time span. However, in the sample data
{ dataset_id: 'mscoco_val2017', image_id: '137576', annotator_id: 93, caption: 'In this image there are group of cows standing and eating th...', timed_caption: [{'utterance': 'In this', 'start_time': 0.0, 'end_time': 0.4}, ...], traces: [[{'x': 0.2086, 'y': -0.0533, 't': 0.022}, ...], ...], voice_recording: 'coco_val/coco_val_137576_93.ogg' }
it is shown that two words 'In this' shares the same time window which is contrary to how the paper describes the start_time and end_time would be assigned... that is to individual words.
Yet, it is correct in other parts of the dataset such as the following
{ dataset_id: ADE20k, image_id: ADE_val_00000175, annotator_id: 125, caption: In this image on the left side I can see a bed and a window...., timed_caption: [{'utterance': 'In', 'start_time': 0.0, 'end_time': 0.0}, {'utterance': 'this', 'start_time': 0.0, 'end_time': 0.8}, ...], traces: [[{'x': 0.6408, 'y': 0.1371, 't': 0.013}, ...], ...], voice_recording: ade20k_validation/ade20k_validation_ADE_val_00000175_125.ogg }
where 'In' and 'this' each have their own start_time and end_time.
I would appreciate if you could help me shed light on this disparity. Sorry if this is a mistake on my part and I thank you for reading my message.
Yours Sincerely,
Gordon
Hi,
would you have the annotation codebase available? I would like to collect similar data for my work.
Thanks
Hi, I have a question about the evaluation process.
Since in the validation set, an image may have multiple traces according to your annotation. Which trace is taken as input? If we take each trace as an individual item to evaluate, we will get multiple hypotheses for a single image id, which is inconsistent with the default setting in the MS coco caption evaluation tool.
A more detailed explanation on how to reproduce the reported results is a great help.
Many thanks for considering my request.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.