mrasp2's People
Forkers
zhajiahe yepg huiyangzhou hongwen-sun trendingtechnology sefazeng mao-ku gdls lileicc s-gege jianwenjun aidrug musemamba sunilitggu smithol acerkoo shimu007 ishine abdelrahman-abouelenin cshanbo szc-coder jalork aliang-nlp rajansaini691 dsj96 happyhappypottermrasp2's Issues
请求解答一下Evaluation时的参数意义
Evaluation时,执行下面的语句,其参数意义可否详细解答一下:
export NUM_GPU=4 && bash eval.sh ${test_config} ${model_config} ${spm_model} ${bleutype}
Release of synonym dictionary
Hi, this is really a great paper. In the paper, you said you would release the synonym dictionary. May I ask when will you release it? In addition, is it a multilingual synonym dictionary? Do you have monolingual synonym dictionary, e.g. only for english.
dataset
请问下论文中指的单向语料是什么意思呢?我个人理解双向语料就是同语义<English,Chinese>的sentence这种形式,但是单向语料没有对应的标签怎么计算交叉熵loss和contrastive loss呢?其次,看代码实现中数据的内容是<String,Coding>的pair对形式,这个和我理解的不太一样,想了好久不知道将pair对的数据处理成这样的。
About swap sample
Hi Xiao,
I have a question about the swap_sample function in label_smoothed_cross_entropy_with_contrastive.py
Here, after swaping the sample, src_tokens are the same as the original target tokens.
However, the padding positions for source and target are different. src uses left padding while tgt uses right padding (see details below)
https://github.com/facebookresearch/fairseq/blob/a0ceabc287e26f64517fadb13a54c83b71e8e469/fairseq/tasks/translation.py#L200
Thus, why not do the left padding for new source tokens (old target tokens), which are originally with the right padding?
Gather all of the outputs from different nodes and then compute contrastive loss
Hi,
this is a really great paper. I have a question about the calculation of the contrastive loss.
In the paper, you "use 8 × 4 NVIDIA V100 with update frequency 50 to train the models and each batch contains about 3 million tokens". Do you compute the contrastive loss within a mini-batch on a GPU or on all GPUs?
For the conventional contrastive loss computing, we need to gather all outputs from the models on different GPUs, and then compute it. But in your code, I can not find the code for that. So do you only compute it with the outputs only on a GPU?
If I misunderstand your code, could you point out where you gather the outputs? Normally we use something like "torch.gather_all()".
Question on WMT16 en-ro
On WMT16 en->ro benchmark, the reported results on this website (28.7) is quite different from that reported in your paper (38.0). Is it possible for you to release your bpe tokenized wmt16 en-ro testset? I am trying to reproduce your results on this benchmark but cannnot achieve comparable performance.
Thanks a lot!
Link broken of datasets
Hi, I find some links to the datasets seems to be broken. And reported the following error "upstream server error". Is there new links provided? Thanks!
Where can I get trained models?
Hi, i'm very interested in your work and wanna do additional experiments with the model.
Where can I get the trained one?
Thank you for your great job!
src tokens fed twice into the encoder?
Hi,
In the criterion script of constructing the contrastive loss, one line is:
for this, [src tokens -> encoder] -> decoder -> output
another line:
for this, src tokens -> encoder -> encoder output, which is the same as the part in [ ] above
It seems you feed the src tokens 2 times into the encoder,
although the loss computation will be right,
will this decrease the training efficiency?
Or anything I missed?
Thank you in advance.
How to make inference with the trained model
How to make inference with the trained model,Can you provide a script example
Inference Error
Hey, this is a really great work. But I ran into a problems when using the model for inferences.
You have released three models: 6e6d-no-mono
, 12e12d-no-mono
and 12e12d
.
I try to use 12e12d-no-mono
and 12e12d
to translate Hindi to English. However, this problem is encountered: sometimes 12e12d
cannot decode the token correctly, but 12e12d-no-mono
can decode it correctly. The following is my test sample and the token predicted by the model:
model: 12e12d
S-6 LANG_TOK_HI इस समय आ@@ ठ अं@@ को के साथ इ@@ ट@@ ली पू@@ ल C में ती@@ स@@ रे नं@@ बर पर हैं और इ@@ ट@@ ली को 29 सि@@ तं@@ बर को स@@ ्@@ कॉ@@ ट@@ ल@@ ै@@ ंड के खि@@ ला@@ फ@@ ़ दूस@@ रे मै@@ च में कड@@ ़@@ ी ट@@ क@@ ्@@ कर मि@@ ली ।
H-6 -0.6864292621612549 LANG_TOK_EN Ital@@ y is now on the th@@ ir@@ d spo@@ t in Po@@ ol C with eig@@ ht points and Ital@@ y fo@@ und a tie on September 29 against Sc@@ ot@@ land in a sec@@ ond mat@@ ch .
S-7 LANG_TOK_HI न@@ ्@@ यू@@ ज@@ ़@@ ी@@ ल@@ ै@@ ंड ग@@ ्@@ रु@@ प में प@@ ्@@ रथम श@@ ्@@ रे@@ णी पर , स@@ ्@@ कॉ@@ ट@@ ल@@ ै@@ ंड से 10 प@@ ॉ@@ इं@@ ट से आ@@ गे रहा ।
H-7 -0.6589236855506897 ् न ् यू@@ जी@@ ल@@ ै@@ ंड सम@@ ू@@ ह में पहले श ् रे@@ णी पर , स ् कॉ@@ ट@@ ल@@ ै@@ ंड से 10 प@@ ॉ@@ इं@@ ट से आ@@ गे रहा ।
model: 12e12d-no-mono
S-6 LANG_TOK_HI इस समय आ@@ ठ अं@@ को के साथ इ@@ ट@@ ली पू@@ ल C में ती@@ स@@ रे नं@@ बर पर हैं और इ@@ ट@@ ली को 29 सि@@ तं@@ बर को स@@ ्@@ कॉ@@ ट@@ ल@@ ै@@ ंड के खि@@ ला@@ फ@@ ़ दूस@@ रे मै@@ च में कड@@ ़@@ ी ट@@ क@@ ्@@ कर मि@@ ली ।
H-6 -0.5951337218284607 LANG_TOK_EN Ital@@ y is cur@@ rent@@ ly th@@ ir@@ d in Po@@ ol C with eig@@ ht points and scor@@ ed a tie against Sc@@ ot@@ land in the sec@@ ond mat@@ ch on September 29 .
S-7 LANG_TOK_HI न@@ ्@@ यू@@ ज@@ ़@@ ी@@ ल@@ ै@@ ंड ग@@ ्@@ रु@@ प में प@@ ्@@ रथम श@@ ्@@ रे@@ णी पर , स@@ ्@@ कॉ@@ ट@@ ल@@ ै@@ ंड से 10 प@@ ॉ@@ इं@@ ट से आ@@ गे रहा ।
H-7 -0.6146384477615356 LANG_TOK_EN In the New Ze@@ al@@ and gro@@ up , it was 10 points a@@ head of Sc@@ ot@@ land in the first clas@@ s .
The following is my script:
model: 12e12d
fairseq-generate ./test_data/bin \ --user-dir ./mcolt \ -s hi \ -t en \ --path ./model/12e12d_last.pt \ --max-tokens 1024 \ --task translation_w_langtok \ --lang-prefix-tok "LANG_TOK_"
echo "en " | tr '[a-z]' '[A-Z]' \ --max-source-positions 1024 \ --max-target-positions 1024 \ --nbest 1 | grep -E '[S|H|P|T]-[0-9]+' > ./test_data/trans_res/en_12e12d_last.txt
model: 12e12d-no-mono
fairseq-generate ./test_data/bin \ --user-dir ./mcolt \ -s hi \ -t en \ --path ./model/12e12d_no_mono.pt \ --max-tokens 1024 \ --task translation_w_langtok \ --lang-prefix-tok "LANG_TOK_"
echo "en " | tr '[a-z]' '[A-Z]' \ --max-source-positions 1024 \ --max-target-positions 1024 \ --nbest 1 | grep -E '[S|H|P|T]-[0-9]+' > ./test_data/trans_res/en_12e12d_no_mono.txt
It can be found that the tokens predicted by the two models for H-7
are completely inconsistent. The first position should be LANG_TOK_EN
, but the model is decoded to ्
. Of course, the tokens after LANG_TOK_
are neither fully source language tokens nor target language tokens. In my testset, there are other sentences that have also been decoded into this situation. And their first token is ्
.
Why does this happen? Did I not input the parameters expected by 12e12d
correctly?
An error occurred during model inference
I am using the new version. There was an error in the model inference stage. The model used 12e12d_no_mono.pt, the data used the test set binary file provided by you, and other configuration files were reconfigured according to examples/configs/biginfer.
File "XXX/lib/python3.7/site-packages/fairseq/checkpoint_utils.py", line 420, in _upgrade_state_dict
state["args"].task = "translation"
AttributeError: 'NoneType' object has no attribute 'task'
Where is self.compute_loss( ) defined
In the python file label_smoothed_cross_entropy_with_contrastive.py
def forward(self, model, sample, reduce=True):
net_output = model(**sample["net_input"])
loss, nll_loss = self.compute_loss(model, net_output, sample, reduce=reduce)
....
Where is self.compute_loss( ) defined?
Dataset for evaluation
Hi there,
I have been trying to use the mRASP2 model for evaluation and have run into a couple of issues.
- The sample yaml file for evaluation has the following format:
data_testset_1:
direction: en2de
name: wmt14
path: data/binarized/en_de/en2de/wmt14
ref: data/dev/en2de/wmt14
What are the path and ref referring to? How do we get the binarized version? Is there a script that I can follow or a link to the dataset that was used.
Additionally, what is the fairseq model used while evaluating?
fairseq-train config question
The script in new_impl cannot be downloaded
For example, when run follow command will return 403
wget -c http://sf3-ttcdn-tos.pstatp.com/obj/nlp-opensource/acl2021/mrasp2/parallel_pub100_bin/download.sh
Could you give more information about how to run the code?
Thank you for your great job!
Could you give more information about how to run the code?
Such as the data format, script of training and inference.
Thank you very much!
loss average
Hi,in this line
why is it sum rather than mean? Does fairseq library will automatically do average in batch? Sorry , I am not familiar with this framework. And I also notice that reduce function is sum in compute_loss https://github.com/pytorch/fairseq/blob/14c5bd027f04aae9dbb32f1bd7b34591b61af97f/fairseq/criterions/label_smoothed_cross_entropy.py#L46
and ntokens/nsenteces mean average token number within a batch, right?
Could you please tell the loss in the early training stage , because according to my empirical experiment, without multiplying ntokens/nsentences to contrastive_loss, it is already in the same order of magnitude, thanks so much!
mrasp2在zh-en,en-zh 方向上的 finetune
您好,请问您在 wmt17的 zh-en 以及 en-zh 两个方向上设置的学习率和optimizer 的具体参数是多少呢?是仅仅用了 wmt17的 中英平行语料进行微调吗?wmt17的中英训练数一共多少 sample呢?大概是训练了多少个 epoch 收敛的?
期待您的答复!
如何将eval.sh运行起来呢?
如何将eval.sh运行起来呢?求指导一下 @PANXiao1994
Project dependencies may have API risk issues
Hi, In mRASP2, inappropriate dependency versioning constraints can cause risks.
Below are the dependencies and version constraints that the project is using
subword-nmt
sacrebleu
sacremoses
kytea
six
The version constraint == will introduce the risk of dependency conflicts because the scope of dependencies is too strict.
The version constraint No Upper Bound and * will introduce the risk of the missing API Error because the latest version of the dependencies may remove some APIs.
After further analysis, in this project,
The version constraint of dependency sacrebleu can be changed to >=1.1.0,<=1.1.1.
The version constraint of dependency sacrebleu can be changed to >=1.1.3,<=1.4.5.
The above modification suggestions can reduce the dependency conflicts as much as possible,
and introduce the latest version as much as possible without calling Error in the projects.
The invocation of the current project includes all the following methods.
The calling methods from the sacrebleu
sacrebleu.corpus_bleu sacrebleu.compute_bleu
The calling methods from the all methods
get_hypo_and_ref numpy.array counts.append fairseq.utils.strip_pad max all_dataset_upsample_ratio.strip fairseq.data.PrependTokenDataset self.swap_sample FileNotFoundError self.tgt_dict.string model.encoder.forward.transpose torch.no_grad inspect.getfullargspec log.get float tqdm.tqdm hyps.append json.loads torch.cat f.read.split self.temperature.anchor_dot_contrast.torch.div.nn.LogSoftmax.diag cls.load_dictionary.index eval.readlines self.dataset.size self.temperature.anchor_dot_contrast.torch.div.nn.LogSoftmax.diag.sum isinstance torch.LongTensor _sentence_embedding self.inference_step fairseq.criterions.label_smoothed_cross_entropy.LabelSmoothedCrossEntropyCriterion.add_args open.close mask.float cls recover_bpe open hasattr id_num.score_dict.append self.dataset.prefetch fairseq.data.TruncateDataset eval.read numpy.array.sum src_list.append fairseq.models.transformer.transformer_wmt_en_de_big_t2t self.tgt_dict.pad eval toks.int src_datasets.append format bpe_symbol.line.replace.rstrip cls.load_dictionary.eos fairseq.data.data_utils.infer_language_pair torch.cat.contiguous logging.getLogger.info argparse.Namespace fairseq.models.register_model_architecture j.line.split similarity_function str Exception ValueError self.set_epoch open.write mask.float.sum.unsqueeze fairseq.data.AppendTokenDataset fairseq.data.encoders.build_tokenizer super.set_epoch super.__init__ cls.load_dictionary.unk size_ratio.dataset.len.np.ceil.astype super self.padding_idx.src_tokens.int.sum numpy.argsort super.reduce_metrics self.padding_idx.src_tokens.int itertools.count os.path.join self.padding_idx.target.int super.build_model generator.generate super.valid_step round int len fairseq.data.indexed_dataset.dataset_exists refs.append os.path.dirname torch.nn.LogSoftmax toks.int.cpu logging.getLogger re.compile mask.unsqueeze self.tokenizer.decode numpy.ceil remove_bpe_fn fairseq.tasks.register_task fairseq.tasks.translation.TranslationTask.add_args re.search.span torch.nn.CosineSimilarity self.dataset.num_tokens totals.append fairseq.utils.deprecation_warning self.compute_loss cls.load_dictionary self.target_dictionary.index prefix_tokens.to.to split_exists fairseq.utils.eval_bool remove_bpe torch.transpose self.len.np.random.permutation.astype getattr fairseq.tasks.translation.load_langpair_dataset torch.div re.search target.contiguous sum_logs fairseq.metrics.log_scalar self.padding_idx.target.int.sum contrast_feature.expand numpy.random.permutation tgt_list.append self.dataset.__getitem__ numpy.random.RandomState cls.load_dictionary.bos src_tokens.size numpy.random.RandomState.choice load_langpair_dataset bpe_symbol.line.replace.rstrip.replace fairseq.data.data_utils.load_indexed_dataset cls.load_dictionary.pad sacrebleu.compute_bleu fairseq.options.eval_bool mask.float.sum map self.get_contrastive_loss fairseq.data.StripTokenDataset self.build_generator fairseq.utils.split_paths fairseq.data.ConcatDataset fairseq.metrics.log_derived decode data.SubsampleLanguagePairDataset model join parser.add_argument id_num.hypothesis_dict.append tgt_datasets.append math.log fairseq.data.plasma_utils.PlasmaArray prefix_tokens.to.expand self.similarity_function all_dataset_upsample_ratio.strip.split fairseq.data.LanguagePairDataset id_num.pos_score_dict.append mask.unsqueeze.encoder_output.sum numpy.arange fairseq.utils.item o.write sacrebleu.corpus_bleu reprocess sample.size re.search.group fairseq.models.transformer.transformer_wmt_en_de fairseq.criterions.register_criterion self._inference_with_bleu mono_datas.append range anchor_feature.expand prefix_tokens.torch.LongTensor.unsqueeze sum model.encoder.forward
@developer
Could please help me check this issue?
May I pull a request to fix it?
Thank you very much.
请问保存checkpoint 12e12d的fairseq版本?
请问保存checkpoint 12e12d的fairseq版本?
十分感谢
您好,想请教论文中一个可视化的问题c
Clarifications in training config
Hello
Thank you for your excellent work, and well-documented repo. I am trying to use your code to train a new model from scratch, and require some clarification on certain parts that are unclear to me, especially regarding the config.
(Please note I am referring to this config on the new_impl
branch as an example of how I could create my own)
-
data
(undermeta
): What does this refer to? Is this the directory that contains binarized versions of one multilingual parallel dataset made by concatenating datasets from several language pairs (eg. en-es, en-fr, en-it) or does it contain language pair-specific binary files in its subdirectories? -
I can see that in load_config.sh variables starting with
meta_
are not written to theoptions
variable, and both monolingual and parallel data are provided separately intrain_multilingual_w_mono.sh
. This seems to suggest that paths are expected in the form ofdata_1
,data_2
etc.. If so, could you please confirm what these paths refer to? I.e. how doesdata_1
differ fromdata_2
? -
What is
mono_dae
? This is referred to repeatedly in the codebase, at various places. Would I need to setmono_key
in the config file tomono_dae
? -
Lastly, I have parallel and monolingual datasets that I have already preprocessed (with RAS substitution and language token prefixes). Would I need to set variables like
langtoks
,encoder_langtok
anddecoder_langtok
?
Hope I can receive some assistance on this issue soon. Thanks!
pre-norm or post-norm
您好!我看paper中有提到在训练时你们使用了pre-norm,但我看公布的代码设置仍然是post-norm,请问这个pre-norm的设置是在哪里另外配置了吗
what the "meta.ras_dict" in config is?
the last line of "meta" in the example config said that there should be a json file called "data/lang150/dicts/id_dict_1.json".
do we have this one in the repo?
or an example what it should contain?
Error downloading dataset
How can we fine-tune the model
it seems that there is no introduction about how to fine-tune the pre-trained model. Could you get me some instructions?
Question about the dataset
Congratulaitons on your interesting work!
Is the dataset introduced in the original implementation different from that in your new implementation?
BTW, is it posible for you to realease the monolingual dataset and various testsets in the tokenized raw text format used in your original implementation?
Thanks!
FileNotFoundError: /home/pgye/mRASP2/mlnlc_mt
I run the train_w_mono.sh,and raise an error, Can you elaborate on how to use the code, Thank you for a job well done!
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.