MT corpus and approach about dl4mt-tutorial HOT 4 CLOSED

nyu-dl commented on July 18, 2024

MT corpus and approach

from dl4mt-tutorial.

Comments (4)

chrishokamp commented on July 18, 2024

Note that the optimization objective is not BLEU score:
https://github.com/kyunghyuncho/dl4mt-material/blob/master/session3/nmt.py#L649-L659

This paper discusses some issues with the discrepancy between training and
inference in this kind of seq2seq model http://arxiv.org/pdf/1506.03099.pdf

On Thu, Jan 28, 2016 at 11:39 AM, Jencir Lee [email protected]
wrote:

1 I'm training the session3/nmtpy with attention on the europarl corpus
with 5000-term vocabulary, 250 word vector dimension, 500 internal
representation dimension It takes a couple of days for a full epoch on AWS
GPU instance (Nvidia K40) with 4G GPU memory I'm just wondering if there's
any knowingly more basic parallel corpus (eg understandable by 10-yo) for
training

2 BLEU metric could make it anodyne for the translation to omit some
pivotal words, eg if the correct translation is "I believe, that xxxx" and
the machine translation omitted "believe" Would there by any idea that an
MT approach could better conserve the structural/compositional information?

—
Reply to this email directly or view it on GitHub
#37.

from dl4mt-tutorial.

chrishokamp commented on July 18, 2024

Also this one http://arxiv.org/pdf/1511.06456.pdf

On Thu, Jan 28, 2016 at 2:04 PM, Chris Hokamp [email protected]
wrote:

Note that the optimization objective is not BLEU score:
https://github.com/kyunghyuncho/dl4mt-material/blob/master/session3/nmt.py#L649-L659

This paper discusses some issues with the discrepancy between training and
inference in this kind of seq2seq model
http://arxiv.org/pdf/1506.03099.pdf

On Thu, Jan 28, 2016 at 11:39 AM, Jencir Lee [email protected]
wrote:

1 I'm training the session3/nmtpy with attention on the europarl corpus
with 5000-term vocabulary, 250 word vector dimension, 500 internal
representation dimension It takes a couple of days for a full epoch on AWS
GPU instance (Nvidia K40) with 4G GPU memory I'm just wondering if there's
any knowingly more basic parallel corpus (eg understandable by 10-yo) for
training

2 BLEU metric could make it anodyne for the translation to omit some
pivotal words, eg if the correct translation is "I believe, that xxxx" and
the machine translation omitted "believe" Would there by any idea that an
MT approach could better conserve the structural/compositional information?

—
Reply to this email directly or view it on GitHub
#37.

from dl4mt-tutorial.

orhanf commented on July 18, 2024

As always, thanks a lot @chrishokamp , i may further add these two :
http://arxiv.org/abs/1512.02433
http://research.microsoft.com/apps/pubs/default.aspx?id=217163
but note that, none of these ideas are implemented in this repo.

@jli05 for your 1st question, may be you can try using TED talks (check IWSLT) or OpenSubtitles (OPUS), I'm not sure tho ?!

from dl4mt-tutorial.

jli05 commented on July 18, 2024

Sorry I just meant the metric generally gives equal emphasis to each term/n-gram (apart from the NIST score), then the metric doesn't reflect well that some errors (eg missing the verb of the entire sentence) are more grave than others -- is it possible for the translation engine to conserve some structural information of source text or at least for the metric to reflect this? (I think S Bowman did a study concluding that LSTM actually preserves some tree-like information)

Thanks for the corpus suggestions. A side question is, when someone sets out to compile a corpus for his NLP training tasks, how much corpus is enough? Is there any practical rule of thumb? I ask this as NLP is different from images. If we were working on an image set, we could tell ourselves: "OK we've got 20 giraffe pictures. We're roughly fine."

from dl4mt-tutorial.

MT corpus and approach about dl4mt-tutorial HOT 4 CLOSED

Comments (4)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent