Giter VIP home page Giter VIP logo

muss's Introduction

Multilingual Unsupervised Sentence Simplification

Code and pretrained models to reproduce experiments in "MUSS: Multilingual Unsupervised Sentence Simplification by Mining Paraphrases".

Prerequisites

Linux with python 3.6 or above (not compatible with python 3.9 yet). If your operating system is Windows, you can use WSL with ubuntu 20.04LTS.

Installing

git clone [email protected]:facebookresearch/muss.git
cd muss/
pip install -e .  # Install package
python -m spacy download pt_core_news_md en_core_web_md fr_core_news_md es_core_news_md # Install required spacy models
ulimit -n 100000 # If you train a new model

How to use

Some scripts might still contain a few bugs, if you notice anything wrong, feel free to open an issue or submit a Pull Request.

Simplify sentences from a file using pretrained models

First, download the template of the desired language in the folder resources/models. Pretrained models should be downloaded automatically, but you can also find them here:

muss_en_wikilarge_mined
muss_en_mined
muss_fr_mined
muss_es_mined
muss_pt_mined

Then run the command:

python scripts/simplify.py FILE_PATH_TO_SIMPLIFY --model-name MODEL_NAME

# English
python scripts/simplify.py scripts/examples.en --model-name muss_en_wikilarge_mined
# French
python scripts/simplify.py scripts/examples.fr --model-name muss_fr_mined
# Spanish
python scripts/simplify.py scripts/examples.es --model-name muss_es_mined
# Portuguese
python scripts/simplify.py scripts/examples.pt --model-name muss_pt_mined

Mine the data

If you are going to add a new language to this project, in folder resources/models/language_models/wikipedia donwload the files of the target language from https://huggingface.co/edugp/kenlm/tree/main/wikipedia. These language models are used to filter high quality sentences in the paraphrase mining phase.

To run paraphrase mining run the command below:

python scripts/mine_sequences.py

Train the model

python scripts/train_model.py NAME_OF_DATASET --language LANGUAGE

Evaluate simplifications

Please head over to EASSE for Sentence Simplification evaluation.

License

The MUSS license is CC-BY-NC. See the LICENSE file for more details.

Authors

Citation

If you use MUSS in your research, please cite MUSS: Multilingual Unsupervised Sentence Simplification by Mining Paraphrases

@article{martin2021muss,
  title={MUSS: Multilingual Unsupervised Sentence Simplification by Mining Paraphrases},
  author={Martin, Louis and Fan, Angela and de la Clergerie, {\'E}ric and Bordes, Antoine and Sagot, Beno{\^\i}t},
  journal={arXiv preprint arXiv:2005.00352},
  year={2021}
}

muss's People

Contributors

assisraphael avatar chirico85 avatar lmvasque avatar louismartin avatar psawa avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

muss's Issues

AttributeError: module 'faiss' has no attribute 'METRIC_L2_DIST

I'm getting the following error when I execute python scripts/mine_sequences.py .

Traceback (most recent call last):
  File "scripts/mine_sequences.py", line 118, in <module>
    train_sentences, get_index_name(), get_embeddings, faiss.METRIC_L2_DIST, base_index_dir
AttributeError: module 'faiss' has no attribute 'METRIC_L2_DIST

Getting this error in both faiss-gpu and faiss-cpu. However when I use it as faiss.METRIC_L2 (without DIST), it works fine. Any idea about the issue ?

Cannot allocate memory

I try to reproduce dataset generation. But when I run scripts/mine_sequences.py , it was always killed. I think that it could be beacuse of memory.
My hardware:
uname -a :

Linux 4.19.0-16-amd64 #1 SMP Debian 4.19.181-1 (2021-03-19) x86_64 GNU/Linux

free -m:

              total        used        free      shared  buff/cache   available

Mem: 515755 169459 275909 857 70385 342168
Swap: 69368 51612 17756

I generate 10 ** 6 sentences, and it was killed when compute embeddings,I dont know how to fix this

Metrics calculation for ASSET TEST Gold Reference

Hi,
May I know how the metrics for ASSET TEST gold reference is calculated? I tried to do it in a leave-one- out scenario where each reference is evaluated against all others but the SARI is a little bit different to the result listed in the paper but FKGL is the same. It would be great if I can know what is wrong with my understanding. Following is my code

complete_ref_path = ''
for i in range(10):
  complete_ref_path += '/content/drive/MyDrive/muss/resources/datasets/asset/test.simple.'+str(i)
  if i != 9:
    complete_ref_path += ','
complete_ref_path_list = complete_ref_path.split(',')
complete_ref_path_list

which gives

['/content/drive/MyDrive/muss/resources/datasets/asset/test.simple.0',
 '/content/drive/MyDrive/muss/resources/datasets/asset/test.simple.1',
 '/content/drive/MyDrive/muss/resources/datasets/asset/test.simple.2',
 '/content/drive/MyDrive/muss/resources/datasets/asset/test.simple.3',
 '/content/drive/MyDrive/muss/resources/datasets/asset/test.simple.4',
 '/content/drive/MyDrive/muss/resources/datasets/asset/test.simple.5',
 '/content/drive/MyDrive/muss/resources/datasets/asset/test.simple.6',
 '/content/drive/MyDrive/muss/resources/datasets/asset/test.simple.7',
 '/content/drive/MyDrive/muss/resources/datasets/asset/test.simple.8',
 '/content/drive/MyDrive/muss/resources/datasets/asset/test.simple.9']
from easse.cli import evaluate_system_output

test_set = 'custom'

bleu_list,fkgl_list,sari_list, sari_add_list,sari_del_list,sari_keep_list= [],[],[],[],[],[]

for i in range(10):

  sys_sents_path = complete_ref_path_list[i]
  orig_sents_path = '/content/drive/MyDrive/muss/resources/datasets/asset/test.complex'
  refs_sents_paths = ','.join(complete_ref_path_list[:i]+complete_ref_path_list[i+1:])

  result = evaluate_system_output(
      test_set,
      sys_sents_path=sys_sents_path,
      orig_sents_path=orig_sents_path,
      refs_sents_paths=refs_sents_paths,
      metrics=['bleu','fkgl','sari', 'sari_by_operation'],
      quality_estimation=False,
  )

  bleu_list.append(result['bleu'])
  fkgl_list.append(result['fkgl'])
  sari_list.append(result['sari'])
  sari_add_list.append(result['sari_add'])
  sari_del_list.append(result['sari_del'])
  sari_keep_list.append(result['sari_keep'])

And the result is

np.mean(sari_list),get_mean_confidence_interval(sari_list)
(44.88725475437498, 0.32906356454497476)

np.mean(fkgl_list),get_mean_confidence_interval(fkgl_list)
(6.487483272312891, 0.15439073457017022)

Thank you in advance!

Preprocess the mined corpus with new control tokens

Hi,

I am exploring MUSS for sentence simplification, and I would like to add restrictions to the simplification style. Let's say that I'd like to ban the generation of sentences in complicated tenses, for example.

My idea is to preprocess the mined corpus with new control tokens that would fit my needs, in addition to the 4 control tokens presented in the paper.

Is it a technique that makes sense to you, and if so, Is there a prefered way to re-annotate the corpus using the code of the repo?

Thanks,

There appear to be 1 leaked folder objects to clean up at shutdown

Hello,

I'm trying to replicate your work for the portuguese language. I made some small adjustments to the code and I'm trying to run the sentence mining phase. However, I'm running into the following error:

raphael@instance-1:~/muss-ptBR$ cat nohup.out 
/home/raphael/.local/lib/python3.8/site-packages/submitit/core/core.py:718: RuntimeWarning: No submission happened during "with executor.batch()" context.
  warnings.warn(
Splitting CCNet shards into smaller subshards...
[]
0it [00:00, ?it/s]
Splitting CCNet shards into smaller subshards completed after 0.09s.
Tokenizing sentences...
[]
0it [00:00, ?it/s]/home/raphael/.local/lib/python3.8/site-packages/joblib/externals/loky/backend/resource_tracker.py:310: UserWarning: resource_tracker: There appear to be 6 leaked semlock objects to clean up at shutdown
  warnings.warn(
/home/raphael/.local/lib/python3.8/site-packages/joblib/externals/loky/backend/resource_tracker.py:310: UserWarning: resource_tracker: There appear to be 1 leaked folder objects to clean up at shutdown
  warnings.warn(
[1]+  Killed                  nohup python3 scripts/mine_sequences.py

At this first moment I'm trying to understand how the code works. I'm running with only 1 shard mined from cc_net and on a machine with 1 TESLA T4 GPU, 8vCPUs and 85GB of RAM. As I understand it is not a hardware problem (more than 50% of the memory was not used) and there are no problems in the process logs. I'm analyzing if it's a problem with the python version I'm using (3.8). Do you have any idea what this error could be?

Simplify in French

2021-11-02

Could you check the simplify French version ?
In your French exemple file, after running the simplification, the cat become a dog ...

Thank's

Can not get alignment to interpret the model behavior

Hi, I am currently trying to get the output of alignment, source sentence and hypotheis, I want to print the source sentence, the alignment, the hypothesis, but my output is just shown as follows:
image

it is difficult for me to connect the score with the original sentence, I just want the output to be the same as the example like this:
image
Can you tell me how to solve my problem?

Generate multiple Output sentences using the simplify.py script.

Hello, I am trying to generate multiple simplifications using simplify.py. I understand the simplify.py uses _fairseq_generate function where you can specify num_hypothesis and best. I increase the num_hypothesis = 12 and nbest = 5. But I am still getting a single simplification whereas it should output multiple simplifications since nbest = 5.

Can you guide me how to generate more simplifications ? @louismartin

UnicodeEncodeError: 'ascii' codec can't encode character '\u2010' in position 48: ordinal not in range(128)

Hi @louismartin,

Thanks for this publishing this work, really nice!

Regarding my inquiry, this is similar to the issue I've posted on easse. On this model, I'm also getting encoding errors when simplifying a list of sentences. I've added the same encoding='utf-8' and it stopped reporting this error.

Here are my diff files:

diff --git a/muss/preprocessors.py b/muss/preprocessors.py
index ee3dd86..6d438d5 100644
--- a/muss/preprocessors.py
+++ b/muss/preprocessors.py
@@ -131 +131 @@ class AbstractPreprocessor(ABC):
-        with open(output_filepath, 'w') as f:
+        with open(output_filepath, 'w', encoding='utf-8') as f:
@@ -139 +139 @@ class AbstractPreprocessor(ABC):
-        with open(output_filepath, 'w') as f:
+        with open(output_filepath, 'w', encoding='utf-8') as f:

diff --git a/muss/utils/helpers.py b/muss/utils/helpers.py
index 25210d8..78a5f41 100644
--- a/muss/utils/helpers.py
+++ b/muss/utils/helpers.py
@@ -91 +91 @@ def open_files(filepaths, mode='r'):
-        files = [Path(filepath).open(mode) for filepath in filepaths]
+        files = [Path(filepath).open(mode, encoding='utf-8') for filepath in filepaths]
@@ -137 +137 @@ def write_lines(lines, filepath=None):
-    with filepath.open('w') as f:
+    with filepath.open('w', encoding='utf-8') as f:
@@ -148 +148 @@ def yield_lines(filepath, gzipped=False, n_lines=None):
-    with open_function(filepath, 'rt') as f:
+    with open_function(filepath, 'rt', encoding='utf-8') as f:
@@ -325 +325 @@ def log_std_streams(filepath):
-    log_file = open(filepath, 'w')
+    log_file = open(filepath, 'w', encoding='utf-8')

diff --git a/scripts/simplify.py b/scripts/simplify.py
index 464129c..95a6231 100644
--- a/scripts/simplify.py
+++ b/scripts/simplify.py
@@ -26,0 +27,2 @@ if __name__ == '__main__':
+        s = s.encode('utf-8')
+        c = c.encode('utf-8')

I'll appreciate if you could add these fixes :)

Thanks,

Laura

Inference Time

Hello, I am using MUSS in google colab notebook, earlier it was taking 3-4 minutes to simplify after downloading the model. I tried it 2 days back and still time taken is ~35 seconds. Is it the usual inference time or there is something I am missing?

error: subprocess-exited-with-error, metadata-generation-failed

Could you please take a look at this error? Thanks in advance!

Using cached faiss-gpu-1.6.4.tar.gz (3.4 kB)
  Preparing metadata (setup.py) ... error
  error: subprocess-exited-with-error
  
  × python setup.py egg_info did not run successfully.
  │ exit code: 1
  ╰─> [7 lines of output]
      running egg_info
      creating /private/var/folders/yr/s2kq90t55dvg8hlk3vxbd_t00000gn/T/pip-pip-egg-info-dn_pikjk/faiss_cpu.egg-info
      writing /private/var/folders/yr/s2kq90t55dvg8hlk3vxbd_t00000gn/T/pip-pip-egg-info-dn_pikjk/faiss_cpu.egg-info/PKG-INFO
      writing dependency_links to /private/var/folders/yr/s2kq90t55dvg8hlk3vxbd_t00000gn/T/pip-pip-egg-info-dn_pikjk/faiss_cpu.egg-info/dependency_links.txt
      writing top-level names to /private/var/folders/yr/s2kq90t55dvg8hlk3vxbd_t00000gn/T/pip-pip-egg-info-dn_pikjk/faiss_cpu.egg-info/top_level.txt
      writing manifest file '/private/var/folders/yr/s2kq90t55dvg8hlk3vxbd_t00000gn/T/pip-pip-egg-info-dn_pikjk/faiss_cpu.egg-info/SOURCES.txt'
      error: package directory '/private/var/folders/yr/s2kq90t55dvg8hlk3vxbd_t00000gn/T/pip-install-_im0yecm/faiss-gpu_5d68e653e8a043259ac9098ab18bb79c/faiss/faiss/python' does not exist
      [end of output]
  
  note: This error originates from a subprocess, and is likely not a problem with pip.
error: metadata-generation-failed

× Encountered error while generating package metadata.
╰─> See above for output.

note: This is an issue with the package mentioned above, not pip.
hint: See above for details.

simplify.py: error: unrecognized arguments: Notebooks/muss/resources/models/muss_fr_mined/model.pt

Hello when I run this command

!python '/content/drive/MyDrive/Colab Notebooks/muss/scripts/simplify.py' '/content/drive/MyDrive/Colab Notebooks/muss/scripts/examples.fr' --model-name muss_fr_mined

in Google Colab I get this error

INFO:root:Generating grammar tables from /usr/lib/python3.7/lib2to3/Grammar.txt
INFO:root:Generating grammar tables from /usr/lib/python3.7/lib2to3/PatternGrammar.txt
Downloading...
... 100% - 6204 MB - 10.12 MB/s - 612s
Extracting...
usage: simplify.py [-h] [--no-progress-bar] [--log-interval LOG_INTERVAL]
[--log-format {json,none,simple,tqdm}]
[--tensorboard-logdir TENSORBOARD_LOGDIR] [--seed SEED]
[--cpu] [--tpu] [--bf16] [--memory-efficient-bf16] [--fp16]
[--memory-efficient-fp16] [--fp16-no-flatten-grads]
[--fp16-init-scale FP16_INIT_SCALE]
[--fp16-scale-window FP16_SCALE_WINDOW]
[--fp16-scale-tolerance FP16_SCALE_TOLERANCE]
[--min-loss-scale MIN_LOSS_SCALE]
[--threshold-loss-scale THRESHOLD_LOSS_SCALE]
[--user-dir USER_DIR] [--empty-cache-freq EMPTY_CACHE_FREQ]
[--all-gather-list-size ALL_GATHER_LIST_SIZE]
[--model-parallel-size MODEL_PARALLEL_SIZE]
[--checkpoint-suffix CHECKPOINT_SUFFIX]
[--checkpoint-shard-count CHECKPOINT_SHARD_COUNT]
[--quantization-config-path QUANTIZATION_CONFIG_PATH]
[--profile]
[--criterion {wav2vec,label_smoothed_cross_entropy,label_smoothed_cross_entropy_with_alignment,adaptive_loss,nat_loss,sentence_ranking,legacy_masked_lm_loss,composite_loss,cross_entropy,ctc,sentence_prediction,masked_lm,vocab_parallel_cross_entropy}]
[--tokenizer {nltk,moses,space}]
[--bpe {gpt2,subword_nmt,hf_byte_bpe,fastbpe,sentencepiece,characters,bert,bytes,byte_bpe}]
[--optimizer {adagrad,adafactor,sgd,lamb,adamax,adadelta,nag,adam}]
[--lr-scheduler {fixed,reduce_lr_on_plateau,triangular,cosine,tri_stage,polynomial_decay,inverse_sqrt}]
[--scoring {wer,chrf,sacrebleu,bleu}] [--task TASK]
[--num-workers NUM_WORKERS]
[--skip-invalid-size-inputs-valid-test]
[--max-tokens MAX_TOKENS] [--batch-size BATCH_SIZE]
[--required-batch-size-multiple REQUIRED_BATCH_SIZE_MULTIPLE]
[--required-seq-len-multiple REQUIRED_SEQ_LEN_MULTIPLE]
[--dataset-impl {raw,lazy,cached,mmap,fasta}]
[--data-buffer-size DATA_BUFFER_SIZE]
[--train-subset TRAIN_SUBSET] [--valid-subset VALID_SUBSET]
[--validate-interval VALIDATE_INTERVAL]
[--validate-interval-updates VALIDATE_INTERVAL_UPDATES]
[--validate-after-updates VALIDATE_AFTER_UPDATES]
[--fixed-validation-seed FIXED_VALIDATION_SEED]
[--disable-validation]
[--max-tokens-valid MAX_TOKENS_VALID]
[--batch-size-valid BATCH_SIZE_VALID]
[--curriculum CURRICULUM] [--gen-subset GEN_SUBSET]
[--num-shards NUM_SHARDS] [--shard-id SHARD_ID]
[--distributed-world-size DISTRIBUTED_WORLD_SIZE]
[--distributed-rank DISTRIBUTED_RANK]
[--distributed-backend DISTRIBUTED_BACKEND]
[--distributed-init-method DISTRIBUTED_INIT_METHOD]
[--distributed-port DISTRIBUTED_PORT]
[--device-id DEVICE_ID] [--distributed-no-spawn]
[--ddp-backend {c10d,no_c10d}]
[--bucket-cap-mb BUCKET_CAP_MB] [--fix-batches-to-gpus]
[--find-unused-parameters] [--fast-stat-sync]
[--broadcast-buffers] [--distributed-wrapper {DDP,SlowMo}]
[--slowmo-momentum SLOWMO_MOMENTUM]
[--slowmo-algorithm SLOWMO_ALGORITHM]
[--localsgd-frequency LOCALSGD_FREQUENCY]
[--nprocs-per-node NPROCS_PER_NODE]
[--pipeline-model-parallel]
[--pipeline-balance PIPELINE_BALANCE]
[--pipeline-devices PIPELINE_DEVICES]
[--pipeline-chunks PIPELINE_CHUNKS]
[--pipeline-encoder-balance PIPELINE_ENCODER_BALANCE]
[--pipeline-encoder-devices PIPELINE_ENCODER_DEVICES]
[--pipeline-decoder-balance PIPELINE_DECODER_BALANCE]
[--pipeline-decoder-devices PIPELINE_DECODER_DEVICES]
[--pipeline-checkpoint {always,never,except_last}]
[--zero-sharding {none,os}] [--path PATH]
[--remove-bpe [REMOVE_BPE]] [--quiet]
[--model-overrides MODEL_OVERRIDES]
[--results-path RESULTS_PATH] [--beam N] [--nbest N]
[--max-len-a N] [--max-len-b N] [--min-len N]
[--match-source-len] [--no-early-stop] [--unnormalized]
[--no-beamable-mm] [--lenpen LENPEN] [--unkpen UNKPEN]
[--replace-unk [REPLACE_UNK]] [--sacrebleu]
[--score-reference] [--prefix-size PS]
[--no-repeat-ngram-size N] [--sampling]
[--sampling-topk PS] [--sampling-topp PS]
[--constraints [{ordered,unordered}]] [--temperature N]
[--diverse-beam-groups N] [--diverse-beam-strength N]
[--diversity-rate N] [--print-alignment] [--print-step]
[--lm-path PATH] [--lm-weight N]
[--iter-decode-eos-penalty N] [--iter-decode-max-iter N]
[--iter-decode-force-max-iter] [--iter-decode-with-beam N]
[--iter-decode-with-external-reranker]
[--retain-iter-history] [--retain-dropout]
[--retain-dropout-modules RETAIN_DROPOUT_MODULES [RETAIN_DROPOUT_MODULES ...]]
[--decoding-format {unigram,ensemble,vote,dp,bs}]
[--force-anneal N] [--lr-shrink LS] [--warmup-updates N]
[-s SRC] [-t TARGET] [--load-alignments]
[--left-pad-source BOOL] [--left-pad-target BOOL]
[--max-source-positions N] [--max-target-positions N]
[--upsample-primary UPSAMPLE_PRIMARY] [--truncate-source]
[--num-batch-buckets N] [--eval-bleu]
[--eval-bleu-detok EVAL_BLEU_DETOK]
[--eval-bleu-detok-args JSON] [--eval-tokenized-bleu]
[--eval-bleu-remove-bpe [EVAL_BLEU_REMOVE_BPE]]
[--eval-bleu-args JSON] [--eval-bleu-print-samples] --langs
LANG [--prepend-bos]
data
simplify.py: error: unrecognized arguments: Notebooks/muss/resources/models/muss_fr_mined/model.pt

Could you provide spm_tokenizer and kenlm model

@lru_cache(maxsize=10)
def get_spm_tokenizer(model_dir):
    merges_file = model_dir / 'spm_tokenizer-merges.txt'
    vocab_file = model_dir / 'spm_tokenizer-vocab.json'
    return SentencePieceBPETokenizer(vocab_file=str(vocab_file), merges_file=str(merges_file))


@lru_cache(maxsize=10)
def get_kenlm_model(model_dir):
    model_file = model_dir / 'kenlm_model.arpa'
    return kenlm.Model(str(model_file))

I find dataset generation need spm_tokenizer and kenlm_model.arpa, could you provide them?

get_easse_report_from_exp_dir Failed

Hello, I am running train_mode.py but get_easse_report_from_exp_dir fail. This is the error


---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
<ipython-input-15-5dc913ca0d14> in <module>()
----> 1 result = fairseq_train_and_evaluate_with_parametrization(**kwargs)

13 frames
/content/drive/MyDrive/muss/muss/fairseq/main.py in fairseq_train_and_evaluate_with_parametrization(dataset, **kwargs)
    228     kwargs['preprocessor_kwargs'] = recommended_preprocessors_kwargs
    229     # Evaluation
--> 230     scores = print_running_time(fairseq_evaluate_and_save)(exp_dir, **kwargs)
    231     score = combine_metrics(scores['bleu'], scores['sari'], scores['fkgl'], kwargs.get('metrics_coefs', [0, 1, 0]))
    232     # TODO: This is a redundant hack with what happens in fairseq_evaluate_and_save (predict_files and evaluate_kwargs), it should be fixed

/content/drive/MyDrive/muss/muss/utils/helpers.py in wrapped_func(*args, **kwargs)
    468         function_name = getattr(func, '__name__', repr(func))
    469         with log_action(function_name):
--> 470             return func(*args, **kwargs)
    471 
    472     return wrapped_func

/content/drive/MyDrive/muss/muss/fairseq/main.py in fairseq_evaluate_and_save(exp_dir, **kwargs)
    104     print(f'scores={scores}')
    105     report_path = exp_dir / 'easse_report.html'
--> 106     shutil.move(get_easse_report_from_exp_dir(exp_dir, **kwargs), report_path)
    107     print(f'report_path={report_path}')
    108     predict_files = kwargs.get(

/content/drive/MyDrive/muss/muss/fairseq/main.py in get_easse_report_from_exp_dir(exp_dir, **kwargs)
     97 def get_easse_report_from_exp_dir(exp_dir, **kwargs):
     98     simplifier = fairseq_get_simplifier(exp_dir, **kwargs)
---> 99     return get_easse_report(simplifier, **kwargs.get('evaluate_kwargs', {'test_set': 'asset_valid'}))
    100 
    101 

/content/drive/MyDrive/muss/muss/evaluation/general.py in get_easse_report(simplifier, test_set, orig_sents_path, refs_sents_paths)
     40         orig_sents_path=orig_sents_path,
     41         refs_sents_paths=refs_sents_paths,
---> 42         report_path=report_path,
     43     )
     44     return report_path

/usr/local/lib/python3.7/dist-packages/easse/cli.py in report(test_set, sys_sents_path, orig_sents_path, refs_sents_paths, report_path, tokenizer, lowercase, metrics)
    302         lowercase=lowercase,
    303         tokenizer=tokenizer,
--> 304         metrics=metrics,
    305     )
    306 

/usr/local/lib/python3.7/dist-packages/easse/report.py in write_html_report(filepath, *args, **kwargs)
    477 def write_html_report(filepath, *args, **kwargs):
    478     with open(filepath, 'w') as f:
--> 479         f.write(get_html_report(*args, **kwargs) + '\n')
    480 
    481 

/usr/local/lib/python3.7/dist-packages/easse/report.py in get_html_report(orig_sents, sys_sents, refs_sents, test_set, lowercase, tokenizer, metrics)
    471             doc.stag('hr')
    472             with doc.tag('div', klass='container-fluid'):
--> 473                 doc.asis(get_qualitative_examples_html(orig_sents, sys_sents, refs_sents))
    474     return indent(doc.getvalue())
    475 

/usr/local/lib/python3.7/dist-packages/easse/report.py in get_qualitative_examples_html(orig_sents, sys_sents, refs_sents)
    154             sample_generator = sorted(
    155                 zip(orig_sents, sys_sents, zip(*refs_sents)),
--> 156                 key=lambda args: sort_key(*args),
    157             )
    158             # Samples displayed by default

/usr/local/lib/python3.7/dist-packages/easse/report.py in <lambda>(args)
    154             sample_generator = sorted(
    155                 zip(orig_sents, sys_sents, zip(*refs_sents)),
--> 156                 key=lambda args: sort_key(*args),
    157             )
    158             # Samples displayed by default

/usr/local/lib/python3.7/dist-packages/easse/report.py in <lambda>(c, s, refs)
     91         (
     92             'Best simplifications according to SARI',
---> 93             lambda c, s, refs: -corpus_sari([c], [s], [refs]),
     94             lambda value: f'SARI={-value:.2f}',
     95         ),

/usr/local/lib/python3.7/dist-packages/easse/sari.py in corpus_sari(*args, **kwargs)
    264 
    265 def corpus_sari(*args, **kwargs):
--> 266     add_score, keep_score, del_score = get_corpus_sari_operation_scores(*args, **kwargs)
    267     return (add_score + keep_score + del_score) / 3

/usr/local/lib/python3.7/dist-packages/easse/sari.py in get_corpus_sari_operation_scores(orig_sents, sys_sents, refs_sents, lowercase, tokenizer, legacy, use_f1_for_deletion, use_paper_version)
    254     refs_sents = [[utils_prep.normalize(sent, lowercase, tokenizer) for sent in ref_sents] for ref_sents in refs_sents]
    255 
--> 256     stats = compute_ngram_stats(orig_sents, sys_sents, refs_sents)
    257 
    258     if not use_paper_version:

/usr/local/lib/python3.7/dist-packages/easse/sari.py in compute_ngram_stats(orig_sents, sys_sents, refs_sents)
    110     assert all(
    111         len(ref_sents) == len(orig_sents) for ref_sents in refs_sents
--> 112     ), "Reference sentences don't have the shape (n_references, n_samples)"
    113     add_sys_correct = [0] * NGRAM_ORDER
    114     add_sys_total = [0] * NGRAM_ORDER

AssertionError: Reference sentences don't have the shape (n_references, n_samples)


I printed out where the error occurs and it showed that

len(refs_sents)=1
len(ref_sents)=10
len(orig_sents)=1

which I suppose should be like this?

len(refs_sents)=10
len(ref_sents)=1
len(orig_sents)=1

I am not sure how to make this change happen without impacting the outcome of the code. I'll appreciate any advice. Thank you in advance!

Simplification not working with python 3.9

Hi,

In a python 3.9 environment, I could not run the simplification script - I was having this error :

python scripts/simplify.py scripts/examples.fr --model-name muss_fr_mined
usage: simplify.py [-h] [--no-progress-bar] [--log-interval LOG_INTERVAL] [--log-format LOG_FORMAT]
                   [--tensorboard-logdir TENSORBOARD_LOGDIR] [--seed SEED] [--cpu] [--tpu] [--bf16] [--memory-efficient-bf16] [--fp16]
                   [--memory-efficient-fp16] [--fp16-no-flatten-grads] [--fp16-init-scale FP16_INIT_SCALE]
                   [--fp16-scale-window FP16_SCALE_WINDOW] [--fp16-scale-tolerance FP16_SCALE_TOLERANCE]
                   [--min-loss-scale MIN_LOSS_SCALE] [--threshold-loss-scale THRESHOLD_LOSS_SCALE] [--user-dir USER_DIR]
                   [--empty-cache-freq EMPTY_CACHE_FREQ] [--all-gather-list-size ALL_GATHER_LIST_SIZE]
                   [--model-parallel-size MODEL_PARALLEL_SIZE] [--checkpoint-suffix CHECKPOINT_SUFFIX]
                   [--checkpoint-shard-count CHECKPOINT_SHARD_COUNT] [--quantization-config-path QUANTIZATION_CONFIG_PATH] [--profile]
                   [--criterion {label_smoothed_cross_entropy,adaptive_loss,wav2vec,cross_entropy,sentence_ranking,label_smoothed_cross_entropy_with_alignment,composite_loss,sentence_prediction,ctc,nat_loss,masked_lm,legacy_masked_lm_loss,vocab_parallel_cross_entropy}]
                   [--tokenizer {nltk,space,moses}]
                   [--bpe {bert,byte_bpe,sentencepiece,bytes,subword_nmt,gpt2,fastbpe,hf_byte_bpe,characters}]
                   [--optimizer {adafactor,adamax,nag,sgd,lamb,adagrad,adadelta,adam}]
                   [--lr-scheduler {reduce_lr_on_plateau,inverse_sqrt,cosine,triangular,tri_stage,polynomial_decay,fixed}]
                   [--scoring {wer,sacrebleu,bleu,chrf}] [--task TASK] [--num-workers NUM_WORKERS]
                   [--skip-invalid-size-inputs-valid-test] [--max-tokens MAX_TOKENS] [--batch-size BATCH_SIZE]
                   [--required-batch-size-multiple REQUIRED_BATCH_SIZE_MULTIPLE]
                   [--required-seq-len-multiple REQUIRED_SEQ_LEN_MULTIPLE] [--dataset-impl DATASET_IMPL]
                   [--data-buffer-size DATA_BUFFER_SIZE] [--train-subset TRAIN_SUBSET] [--valid-subset VALID_SUBSET]
                   [--validate-interval VALIDATE_INTERVAL] [--validate-interval-updates VALIDATE_INTERVAL_UPDATES]
                   [--validate-after-updates VALIDATE_AFTER_UPDATES] [--fixed-validation-seed FIXED_VALIDATION_SEED]
                   [--disable-validation] [--max-tokens-valid MAX_TOKENS_VALID] [--batch-size-valid BATCH_SIZE_VALID]
                   [--curriculum CURRICULUM] [--gen-subset GEN_SUBSET] [--num-shards NUM_SHARDS] [--shard-id SHARD_ID]
                   [--distributed-world-size DISTRIBUTED_WORLD_SIZE] [--distributed-rank DISTRIBUTED_RANK]
                   [--distributed-backend DISTRIBUTED_BACKEND] [--distributed-init-method DISTRIBUTED_INIT_METHOD]
                   [--distributed-port DISTRIBUTED_PORT] [--device-id DEVICE_ID] [--local-rank LOCAL_RANK] [--distributed-no-spawn]
                   [--ddp-backend {c10d,no_c10d}] [--bucket-cap-mb BUCKET_CAP_MB] [--fix-batches-to-gpus] [--find-unused-parameters]
                   [--fast-stat-sync] [--broadcast-buffers] [--distributed-wrapper {DDP,SlowMo}] [--slowmo-momentum SLOWMO_MOMENTUM]
                   [--slowmo-algorithm SLOWMO_ALGORITHM] [--localsgd-frequency LOCALSGD_FREQUENCY] [--nprocs-per-node NPROCS_PER_NODE]
                   [--pipeline-model-parallel] [--pipeline-balance PIPELINE_BALANCE] [--pipeline-devices PIPELINE_DEVICES]
                   [--pipeline-chunks PIPELINE_CHUNKS] [--pipeline-encoder-balance PIPELINE_ENCODER_BALANCE]
                   [--pipeline-encoder-devices PIPELINE_ENCODER_DEVICES] [--pipeline-decoder-balance PIPELINE_DECODER_BALANCE]
                   [--pipeline-decoder-devices PIPELINE_DECODER_DEVICES] [--pipeline-checkpoint {always,never,except_last}]
                   [--zero-sharding {none,os}] [--path PATH] [--remove-bpe [REMOVE_BPE]] [--quiet] [--model-overrides MODEL_OVERRIDES]
                   [--results-path RESULTS_PATH] [--beam N] [--nbest N] [--max-len-a N] [--max-len-b N] [--min-len N]
                   [--match-source-len] [--no-early-stop] [--unnormalized] [--no-beamable-mm] [--lenpen LENPEN] [--unkpen UNKPEN]
                   [--replace-unk [REPLACE_UNK]] [--sacrebleu] [--score-reference] [--prefix-size PS] [--no-repeat-ngram-size N]
                   [--sampling] [--sampling-topk PS] [--sampling-topp PS] [--constraints [{ordered,unordered}]] [--temperature N]
                   [--diverse-beam-groups N] [--diverse-beam-strength N] [--diversity-rate N] [--print-alignment] [--print-step]
                   [--lm-path PATH] [--lm-weight N] [--iter-decode-eos-penalty N] [--iter-decode-max-iter N]
                   [--iter-decode-force-max-iter] [--iter-decode-with-beam N] [--iter-decode-with-external-reranker]
                   [--retain-iter-history] [--retain-dropout]
                   [--retain-dropout-modules RETAIN_DROPOUT_MODULES [RETAIN_DROPOUT_MODULES ...]]
                   [--decoding-format {unigram,ensemble,vote,dp,bs}]
simplify.py: error: argument --dataset-impl: invalid typing.Optional[fairseq.dataclass.utils.Choices] value: 'raw' 

Downgrading to python 3.7 and re running the setup fixed it. I'm reporting this since 20848f6 was intended to fix compatibility with 3.9, but it seems not to be working.

Anyhow, thanks for this great piece!

Need help to get better performances

Hello,

It is the first time that I use this project. I tried to use an example from the README but I have a question about the execution speed.
The execution time is about from 40 to 60 seconds.
I have put timers in the muss code to find which part of the code is spending this time. It seems that it is the call to generate.cli_main() at line 188 of muss/fairseq/base.py file.

Could you explain me if this duration is standard or if I should get better performances?
Is there something I can do to speed up this?

The example I have tried is:
time python scripts/simplify.py scripts/examples.fr --model-name muss_fr_mined

To test muss, I have created a docker image that is deployed on a GPU node (T1-45 from OVH) of a Kubernetes cluster.
The code is available here: muss-docker-debug and the image is pushed here: https://hub.docker.com/r/cleyrop/muss-debug.
The GPU node caracteristics are:

45 GB RAM
8 vCores (2.1 GHz)
400 GB SSD
2,000 Mbit/s
Tesla V100 

Here are the traces from the execution:

~/muss$ time python scripts/simplify.py scripts/examples.fr --model-name muss_fr_mined
  0%|                                                                                                                                                                                                                                                     | 0/1 [00:00<?, ?it/s]/home/muss/.local/lib/python3.7/site-packages/fairseq/search.py:140: UserWarning: __floordiv__ is deprecated, and its behavior will change in a future version of pytorch. It currently rounds toward 0 (like the 'trunc' function NOT 'floor'). This results in incorrect rounding for negative values. To keep the current behavior, use torch.div(a, b, rounding_mode='trunc'), or for actual floor division, use torch.div(a, b, rounding_mode='floor').
  beams_buf = indices_buf // vocab_size
/home/muss/.local/lib/python3.7/site-packages/fairseq/sequence_generator.py:651: UserWarning: __floordiv__ is deprecated, and its behavior will change in a future version of pytorch. It currently rounds toward 0 (like the 'trunc' function NOT 'floor'). This results in incorrect rounding for negative values. To keep the current behavior, use torch.div(a, b, rounding_mode='trunc'), or for actual floor division, use torch.div(a, b, rounding_mode='floor').
  unfin_idx = idx // beam_size
--------------------------------------------------------------------------------                                                                                                                                                                                                
Original:   Cette phrase est extrêmement compliquée à comprendre.
Simplified: Cette phrase est très difficile à comprendre.
--------------------------------------------------------------------------------
Original:   La souris est mangée par le chat.
Simplified: La souris est mangée par le chien.
--------------------------------------------------------------------------------
Original:   Facile à lire et à comprendre (FALC) désigne un ensemble de règles ayant pour finalité de rendre l'information facile à lire et à comprendre.
Simplified: Facile à lire et à comprendre (FALC) est un ensemble de règles visant à rendre l'information facile à comprendre et à lire.
--------------------------------------------------------------------------------
Original:   L'altruisme efficace vise à adopter une démarche analytique afin d’identifier les meilleurs moyens d’avoir un impact positif sur le monde.
Simplified: L'altruisme efficace est une démarche analytique visant à identifier les meilleurs moyens d'avoir un impact positif sur le monde.

real   0m48.031s
user   0m22.117s
sys    0m25.155s

Training error: hydra error. Parameter lr_scheduler.total_num_update=null

Hi,
I'm trying to train muss, but got an hydra error:

fairseq_prepare_and_train...
exp_dir=/scratch1/fer201/muss/muss-git/experiments/fairseq/local_1634083552326
fairseq-train /scratch1/fer201/muss/muss-git/resources/datasets/_9585ac127caca9d7160a28f1d8180050/fairseq_preprocessed_complex-simple --task translation --source-lang complex --target-lang simple --save-dir /scratch1/fer201/muss/muss-git/experiments/fairseq/local_1634083552326/checkpoints --optimizer adam --adam-betas '(0.9, 0.98)' --clip-norm 0.0 --criterion label_smoothed_cross_entropy --label-smoothing 0.1 --lr-scheduler polynomial_decay --lr 3e-05 --warmup-updates 500 --update-freq 128 --arch bart_large --dropout 0.1 --weight-decay 0.0 --clip-norm 0.1 --share-all-embeddings --no-epoch-checkpoints --save-interval 999999 --validate-interval 999999 --max-update 20000 --save-interval-updates 100 --keep-interval-updates 1 --patience 10 --batch-size 64 --seed 917 --distributed-world-size 1 --distributed-port 15798 --fp16 --restore-file /scratch1/fer201/muss/muss-git/resources/models/bart.large/model.pt --max-tokens 512 --truncate-source --layernorm-embedding --share-all-embeddings --share-decoder-input-output-embed --reset-optimizer --reset-dataloader --reset-meters --required-batch-size-multiple 1 --label-smoothing 0.1 --attention-dropout 0.1 --weight-decay 0.01 --optimizer 'adam' --adam-betas '(0.9, 0.999)' --adam-eps 1e-08 --clip-norm 0.1 --skip-invalid-size-inputs-valid-test --find-unused-parameters
fairseq_prepare_and_train failed after 4.45s.
Traceback (most recent call last):
File "/scratch1/fer201/muss/lib/python3.9/site-packages/hydra/_internal/config_loader_impl.py", line 513, in _apply_overrides_to_config
OmegaConf.update(cfg, key, value, merge=True)
File "/scratch1/fer201/muss/lib/python3.9/site-packages/omegaconf/omegaconf.py", line 613, in update
root.setattr(last_key, value)
File "/scratch1/fer201/muss/lib/python3.9/site-packages/omegaconf/dictconfig.py", line 285, in setattr
raise e
File "/scratch1/fer201/muss/lib/python3.9/site-packages/omegaconf/dictconfig.py", line 282, in setattr
self.__set_impl(key, value)
File "/scratch1/fer201/muss/lib/python3.9/site-packages/omegaconf/dictconfig.py", line 266, in __set_impl
self._set_item_impl(key, value)
File "/scratch1/fer201/muss/lib/python3.9/site-packages/omegaconf/basecontainer.py", line 398, in _set_item_impl
self._validate_set(key, value)
File "/scratch1/fer201/muss/lib/python3.9/site-packages/omegaconf/dictconfig.py", line 143, in _validate_set
self._validate_set_merge_impl(key, value, is_assign=True)
File "/scratch1/fer201/muss/lib/python3.9/site-packages/omegaconf/dictconfig.py", line 156, in _validate_set_merge_impl
self._format_and_raise(
File "/scratch1/fer201/muss/lib/python3.9/site-packages/omegaconf/base.py", line 95, in _format_and_raise
format_and_raise(
File "/scratch1/fer201/muss/lib/python3.9/site-packages/omegaconf/_utils.py", line 694, in format_and_raise
_raise(ex, cause)
File "/scratch1/fer201/muss/lib/python3.9/site-packages/omegaconf/_utils.py", line 610, in _raise
raise ex # set end OC_CAUSE=1 for full backtrace
omegaconf.errors.ValidationError: child 'lr_scheduler.total_num_update' is not Optional
full_key: lr_scheduler.total_num_update
reference_type=Optional[PolynomialDecayLRScheduleConfig]
object_type=PolynomialDecayLRScheduleConfig

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "/scratch1/fer201/muss/muss-git/scripts/train_model.py", line 21, in
result = fairseq_train_and_evaluate_with_parametrization(**kwargs)
File "/scratch1/fer201/muss/muss-git/muss/fairseq/main.py", line 224, in fairseq_train_and_evaluate_with_parametrization
exp_dir = print_running_time(fairseq_prepare_and_train)(dataset, **kwargs)
File "/scratch1/fer201/muss/muss-git/muss/utils/helpers.py", line 470, in wrapped_func
return func(*args, **kwargs)
File "/scratch1/fer201/muss/muss-git/muss/fairseq/main.py", line 74, in fairseq_prepare_and_train
fairseq_train(preprocessed_dir, exp_dir=exp_dir, **train_kwargs)
File "/scratch1/fer201/muss/muss-git/muss/utils/training.py", line 60, in wrapped_func
return func(*args, **kwargs)
File "/scratch1/fer201/muss/muss-git/muss/fairseq/base.py", line 127, in fairseq_train
train.cli_main()
File "/scratch1/fer201/muss/fairseq-git/fairseq_cli/train.py", line 496, in cli_main
cfg = convert_namespace_to_omegaconf(args)
File "/scratch1/fer201/muss/fairseq-git/fairseq/dataclass/utils.py", line 389, in convert_namespace_to_omegaconf
composed_cfg = compose("config", overrides=overrides, strict=False)
File "/scratch1/fer201/muss/lib/python3.9/site-packages/hydra/experimental/compose.py", line 31, in compose
cfg = gh.hydra.compose_config(
File "/scratch1/fer201/muss/lib/python3.9/site-packages/hydra/_internal/hydra.py", line 507, in compose_config
cfg = self.config_loader.load_configuration(
File "/scratch1/fer201/muss/lib/python3.9/site-packages/hydra/_internal/config_loader_impl.py", line 151, in load_configuration
return self._load_configuration(
File "/scratch1/fer201/muss/lib/python3.9/site-packages/hydra/_internal/config_loader_impl.py", line 277, in _load_configuration
ConfigLoaderImpl._apply_overrides_to_config(config_overrides, cfg)
File "/scratch1/fer201/muss/lib/python3.9/site-packages/hydra/_internal/config_loader_impl.py", line 520, in _apply_overrides_to_config
raise ConfigCompositionException(
hydra.errors.ConfigCompositionException: Error merging override lr_scheduler.total_num_update=null

[Help wanted] Fine tune pre-trained muss(mbart-large-cc25) with mined paraphrases (si_LK)

For my final year research project, I'm using this approach as my baseline and trying to do text simplification for Sinhala language (native language used in Sri Lanka). I don't have enough infrastructure to run the cc_net pipeline to crawl language data but I used several already existing sources (15.7 million plain sentences) and mined them to get paraphrases. Now as the next stage I'm trying to fine tune mbart using the mined paraphrases. I went through the train_model.py and train_paper_models.py and implemented a similar script for Sinhala by making changes to the related methods.

from muss.fairseq.main import fairseq_train_and_evaluate_with_parametrization
from muss.mining.training import get_mbart_kwargs


sin15M = 'sin15M'
kwargs = get_mbart_kwargs(dataset=sin15M, language='si', use_access=False)
kwargs['train_kwargs']['ngpus'] = 1  # Set this from 8 to 1 for local training
kwargs['train_kwargs']['max_tokens'] = 512  # Lower this number to prevent OOM
result = fairseq_train_and_evaluate_with_parametrization(**kwargs)

I tried to run this for a small sample of 6000 train sentences, 750 test and valid sentences in a nvidia tesla t4 with 16GB of GPU memory. I get OOM issues when I tried the kwargs['train_kwargs']['max_tokens'] with values 512,256,64,32,16,8. With 4 it says no dataset found. Any idea what could go wrong here ???

PS - I did not change anything internally related to the models. Only changed the helper methods to get datasets.

train model failed again

Using the latest code
The error is below:

Traceback (most recent call last):
File "scripts/train_models.py", line 82, in
[job.result() for jobs in jobs_dict.values() for job in jobs]
File "scripts/train_models.py", line 82, in
[job.result() for jobs in jobs_dict.values() for job in jobs]
File "/home/liuyijiao/torch_venv/lib/python3.7/site-packages/submitit/core/core.py", line 261, in result
r = self.results()
File "/home/liuyijiao/torch_venv/lib/python3.7/site-packages/submitit/local/debug.py", line 72, in results
return [self._submission.result()]
File "/home/liuyijiao/torch_venv/lib/python3.7/site-packages/submitit/core/utils.py", line 128, in result
self._result = self.function(*self.args, **self.kwargs)
File "/home/liuyijiao/muss/muss/utils/training.py", line 19, in wrapped_func
return func(*args, **kwargs)
File "/home/liuyijiao/muss/muss/utils/training.py", line 39, in wrapped_func
return func(*args, **kwargs)
File "/home/liuyijiao/muss/muss/utils/submitit.py", line 41, in wrapped_func
return func(*args, **kwargs)
File "/home/liuyijiao/muss/muss/utils/training.py", line 49, in wrapped_func
result = func(*args, **kwargs)
File "/home/liuyijiao/muss/muss/utils/helpers.py", line 470, in wrapped_func
return func(*args, **kwargs)
File "/home/liuyijiao/muss/muss/fairseq/main.py", line 228, in fairseq_train_and_evaluate_with_parametrization
recommended_preprocessors_kwargs = print_running_time(find_best_parametrization)(exp_dir, **kwargs)
File "/home/liuyijiao/muss/muss/utils/helpers.py", line 470, in wrapped_func
return func(*args, **kwargs)
File "/home/liuyijiao/muss/muss/fairseq/main.py", line 174, in find_best_parametrization
return find_best_parametrization_nevergrad(exp_dir, preprocessors_kwargs, *args, **kwargs)
File "/home/liuyijiao/muss/muss/fairseq/main.py", line 150, in find_best_parametrization_nevergrad
recommendation = optimizer.minimize(evaluate_parametrization, verbosity=0)
File "/home/liuyijiao/torch_venv/lib/python3.7/site-packages/nevergrad/optimization/base.py", line 460, in minimize
result = job.result()
File "/home/liuyijiao/torch_venv/lib/python3.7/site-packages/nevergrad/optimization/utils.py", line 133, in result
self._result = self.func(*self.args, **self.kwargs)
File "/home/liuyijiao/muss/muss/fairseq/main.py", line 130, in evaluate_parametrization
scores = evaluate_simplifier(simplifier, **kwargs.get('evaluate_kwargs', {'test_set': 'asset_valid'}))
File "/home/liuyijiao/muss/muss/evaluation/general.py", line 20, in evaluate_simplifier
sys_sents_path = simplifier(orig_sents_path)
File "/home/liuyijiao/muss/muss/simplifiers.py", line 42, in wrapped
simplifier(complex_filepath, pred_filepath)
File "/home/liuyijiao/muss/muss/simplifiers.py", line 30, in wrapped
simplifier(complex_filepath, pred_filepath)
File "/home/liuyijiao/muss/muss/simplifiers.py", line 68, in preprocessed_simplifier
preprocessed_pred_filepath = simplifier(preprocessed_complex_filepath)
File "/home/liuyijiao/muss/muss/simplifiers.py", line 42, in wrapped
simplifier(complex_filepath, pred_filepath)
File "/home/liuyijiao/muss/muss/simplifiers.py", line 30, in wrapped
simplifier(complex_filepath, pred_filepath)
File "/home/liuyijiao/muss/muss/simplifiers.py", line 54, in fairseq_simplifier
fairseq_generate(complex_filepath, output_pred_filepath, exp_dir, **kwargs)
File "/home/liuyijiao/muss/muss/fairseq/base.py", line 278, in fairseq_generate
**kwargs,
File "/home/liuyijiao/muss/muss/utils/training.py", line 60, in wrapped_func
return func(*args, **kwargs)
File "/home/liuyijiao/muss/muss/fairseq/base.py", line 231, in _fairseq_generate
generate.cli_main()
File "/home/liuyijiao/torch_venv/lib/python3.7/site-packages/fairseq_cli/generate.py", line 382, in cli_main
main(args)
File "/home/liuyijiao/torch_venv/lib/python3.7/site-packages/fairseq_cli/generate.py", line 41, in main
return _main(args, sys.stdout)
File "/home/liuyijiao/torch_venv/lib/python3.7/site-packages/fairseq_cli/generate.py", line 179, in _main
for sample in progress:
File "/home/liuyijiao/torch_venv/lib/python3.7/site-packages/tqdm/std.py", line 1127, in iter
for obj in iterable:
File "/home/liuyijiao/torch_venv/lib/python3.7/site-packages/fairseq/data/iterators.py", line 59, in iter
for x in self.iterable:
File "/home/liuyijiao/torch_venv/lib/python3.7/site-packages/fairseq/data/iterators.py", line 591, in next
raise item
File "/home/liuyijiao/torch_venv/lib/python3.7/site-packages/fairseq/data/iterators.py", line 522, in run
for item in self._source:
File "/home/liuyijiao/torch_venv/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 435, in next
data = self._next_data()
File "/home/liuyijiao/torch_venv/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 1068, in _next_data
idx, data = self._get_data()
File "/home/liuyijiao/torch_venv/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 1034, in _get_data
success, data = self._try_get_data()
File "/home/liuyijiao/torch_venv/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 905, in _try_get_data
" at the beginning of your code") from None
RuntimeError: Too many open files. Communication with the workers is no longer possible. Please increase the limit using ulimit -n in the shell or change the sharing strategy by calling torch.multiprocessing.set_sharing_strategy('file_system') at the beginning of your code
36%|█████████████████▎ | 46/128 [59:14:20<105:36:00, 4636.10s/it]

By the way,It happens in find_best_parametrization.But when I change the max_update from 50000 to 50 in fairseq/base.py, it seems fine.I check the ulimit in my machine:

$ ulimit -Sn
1024
$ ulimit -Hn
1048576

By the way, max-sentences error also happens in find_best_parametrization using get_mbart_kwargs

Download Problem

Hello
when I run this commmand
cd muss/
pip install -e .

my cmd log stop here
Please help me
error

train model failed

I change cluster "local" to "debug" in scripts/train_model.py and I run the command "python3 scripts/train_models.py' ,but fail
The error :

fairseq-train /home/liuyijiao/muss/resources/datasets/_d41b33752d58c3fa688aef596b98df2b/fairseq_preprocessed_complex-simple --task translation --source-lang complex --target-lang simple --save-dir /home/liuyijiao/muss/experiments/fairseq/slurmjob_DEBUG_139908269653632/checkpoints --optimizer adam --adam-betas '(0.9, 0.98)' --clip-norm 0.0 --criterion label_smoothed_cross_entropy --label-smoothing 0.1 --lr-scheduler polynomial_decay --lr 3e-05 --warmup-updates 2500 --update-freq 16 --arch mbart_large --dropout 0.3 --weight-decay 0.0 --clip-norm 0.1 --share-all-embeddings --no-epoch-checkpoints --save-interval 999999 --validate-interval 999999 --max-update 50000 --save-interval-updates 100 --keep-interval-updates 1 --patience 10 --max-sentences 64 --seed 708 --distributed-world-size 8 --distributed-port 11733 --fp16 --restore-file '/home/liuyijiao/muss/resources/models/mbart/model.pt' --task 'translation_from_pretrained_bart' --source-lang 'complex' --target-lang 'simple' --encoder-normalize-before --decoder-normalize-before --label-smoothing 0.2 --dataset-impl 'mmap' --optimizer 'adam' --adam-eps 1e-06 --adam-betas '(0.9, 0.98)' --min-lr -1 --total-num-update 40000 --attention-dropout 0.1 --weight-decay 0.0 --max-tokens 1024 --update-freq 2 --log-format 'simple' --log-interval 2 --reset-optimizer --reset-meters --reset-dataloader --reset-lr-scheduler --langs 'ar_AR,cs_CZ,de_DE,en_XX,es_XX,et_EE,fi_FI,fr_XX,gu_IN,hi_IN,it_IT,ja_XX,kk_KZ,ko_KR,lt_LT,lv_LV,my_MM,ne_NP,nl_XX,ro_RO,ru_RU,si_LK,tr_TR,vi_VN,zh_CN' --layernorm-embedding --ddp-backend 'no_c10d'
usage: train_models.py [-h] [--no-progress-bar] [--log-interval LOG_INTERVAL]
[--log-format {json,none,simple,tqdm}]
[--tensorboard-logdir TENSORBOARD_LOGDIR] [--seed SEED]
[--cpu] [--tpu] [--bf16] [--memory-efficient-bf16]
[--fp16] [--memory-efficient-fp16]
[--fp16-no-flatten-grads]
[--fp16-init-scale FP16_INIT_SCALE]
[--fp16-scale-window FP16_SCALE_WINDOW]
[--fp16-scale-tolerance FP16_SCALE_TOLERANCE]
[--min-loss-scale MIN_LOSS_SCALE]
[--threshold-loss-scale THRESHOLD_LOSS_SCALE]
[--user-dir USER_DIR]
[--empty-cache-freq EMPTY_CACHE_FREQ]
[--all-gather-list-size ALL_GATHER_LIST_SIZE]
[--model-parallel-size MODEL_PARALLEL_SIZE]
[--checkpoint-suffix CHECKPOINT_SUFFIX]
[--checkpoint-shard-count CHECKPOINT_SHARD_COUNT]
[--quantization-config-path QUANTIZATION_CONFIG_PATH]
[--profile]
[--criterion {sentence_ranking,label_smoothed_cross_entropy,label_smoothed_cross_entropy_with_alignment,sentence_prediction,cross_entropy,ctc,legacy_masked_lm_loss,masked_lm,adaptive_loss,nat_loss,composite_loss,wav2vec,vocab_parallel_cross_entropy}]
[--tokenizer {nltk,moses,space}]
[--bpe {byte_bpe,subword_nmt,sentencepiece,gpt2,characters,bert,hf_byte_bpe,bytes,fastbpe}]
[--optimizer {sgd,adagrad,nag,adadelta,lamb,adafactor,adamax,adam}]
[--lr-scheduler {inverse_sqrt,tri_stage,reduce_lr_on_plateau,triangular,polynomial_decay,cosine,fixed}]
[--scoring {sacrebleu,bleu,wer,chrf}] [--task TASK]
[--num-workers NUM_WORKERS]
[--skip-invalid-size-inputs-valid-test]
[--max-tokens MAX_TOKENS] [--batch-size BATCH_SIZE]
[--required-batch-size-multiple REQUIRED_BATCH_SIZE_MULTIPLE]
[--required-seq-len-multiple REQUIRED_SEQ_LEN_MULTIPLE]
[--dataset-impl {raw,lazy,cached,mmap,fasta}]
[--data-buffer-size DATA_BUFFER_SIZE]
[--train-subset TRAIN_SUBSET]
[--valid-subset VALID_SUBSET]
[--validate-interval VALIDATE_INTERVAL]
[--validate-interval-updates VALIDATE_INTERVAL_UPDATES]
[--validate-after-updates VALIDATE_AFTER_UPDATES]
[--fixed-validation-seed FIXED_VALIDATION_SEED]
[--disable-validation]
[--max-tokens-valid MAX_TOKENS_VALID]
[--batch-size-valid BATCH_SIZE_VALID]
[--curriculum CURRICULUM] [--gen-subset GEN_SUBSET]
[--num-shards NUM_SHARDS] [--shard-id SHARD_ID]
[--distributed-world-size DISTRIBUTED_WORLD_SIZE]
[--distributed-rank DISTRIBUTED_RANK]
[--distributed-backend DISTRIBUTED_BACKEND]
[--distributed-init-method DISTRIBUTED_INIT_METHOD]
[--distributed-port DISTRIBUTED_PORT]
[--device-id DEVICE_ID] [--distributed-no-spawn]
[--ddp-backend {c10d,no_c10d}]
[--bucket-cap-mb BUCKET_CAP_MB] [--fix-batches-to-gpus]
[--find-unused-parameters] [--fast-stat-sync]
[--broadcast-buffers]
[--distributed-wrapper {DDP,SlowMo}]
[--slowmo-momentum SLOWMO_MOMENTUM]
[--slowmo-algorithm SLOWMO_ALGORITHM]
[--localsgd-frequency LOCALSGD_FREQUENCY]
[--nprocs-per-node NPROCS_PER_NODE]
[--pipeline-model-parallel]
[--pipeline-balance PIPELINE_BALANCE]
[--pipeline-devices PIPELINE_DEVICES]
[--pipeline-chunks PIPELINE_CHUNKS]
[--pipeline-encoder-balance PIPELINE_ENCODER_BALANCE]
[--pipeline-encoder-devices PIPELINE_ENCODER_DEVICES]
[--pipeline-decoder-balance PIPELINE_DECODER_BALANCE]
[--pipeline-decoder-devices PIPELINE_DECODER_DEVICES]
[--pipeline-checkpoint {always,never,except_last}]
[--zero-sharding {none,os}] [--arch ARCH]
[--max-epoch MAX_EPOCH] [--max-update MAX_UPDATE]
[--stop-time-hours STOP_TIME_HOURS]
[--clip-norm CLIP_NORM] [--sentence-avg]
[--update-freq UPDATE_FREQ] [--lr LR] [--min-lr MIN_LR]
[--use-bmuf] [--save-dir SAVE_DIR]
[--restore-file RESTORE_FILE]
[--finetune-from-model FINETUNE_FROM_MODEL]
[--reset-dataloader] [--reset-lr-scheduler]
[--reset-meters] [--reset-optimizer]
[--optimizer-overrides OPTIMIZER_OVERRIDES]
[--save-interval SAVE_INTERVAL]
[--save-interval-updates SAVE_INTERVAL_UPDATES]
[--keep-interval-updates KEEP_INTERVAL_UPDATES]
[--keep-last-epochs KEEP_LAST_EPOCHS]
[--keep-best-checkpoints KEEP_BEST_CHECKPOINTS]
[--no-save] [--no-epoch-checkpoints]
[--no-last-checkpoints] [--no-save-optimizer-state]
[--best-checkpoint-metric BEST_CHECKPOINT_METRIC]
[--maximize-best-checkpoint-metric]
[--patience PATIENCE]
[--activation-fn {relu,gelu,gelu_fast,gelu_accurate,tanh,linear}]
[--dropout D] [--attention-dropout D]
[--activation-dropout D] [--encoder-embed-path STR]
[--encoder-embed-dim N] [--encoder-ffn-embed-dim N]
[--encoder-layers N] [--encoder-attention-heads N]
[--encoder-normalize-before] [--encoder-learned-pos]
[--decoder-embed-path STR] [--decoder-embed-dim N]
[--decoder-ffn-embed-dim N] [--decoder-layers N]
[--decoder-attention-heads N] [--decoder-learned-pos]
[--decoder-normalize-before] [--decoder-output-dim N]
[--share-decoder-input-output-embed]
[--share-all-embeddings]
[--no-token-positional-embeddings]
[--adaptive-softmax-cutoff EXPR]
[--adaptive-softmax-dropout D] [--layernorm-embedding]
[--no-scale-embedding] [--no-cross-attention]
[--cross-self-attention] [--encoder-layerdrop D]
[--decoder-layerdrop D]
[--encoder-layers-to-keep ENCODER_LAYERS_TO_KEEP]
[--decoder-layers-to-keep DECODER_LAYERS_TO_KEEP]
[--quant-noise-pq D] [--quant-noise-pq-block-size D]
[--quant-noise-scalar D] [--pooler-dropout D]
[--pooler-activation-fn {relu,gelu,gelu_fast,gelu_accurate,tanh,linear}]
[--spectral-norm-classification-head]
[--label-smoothing D] [--report-accuracy]
[--ignore-prefix-size IGNORE_PREFIX_SIZE]
[--adam-betas ADAM_BETAS] [--adam-eps ADAM_EPS]
[--weight-decay WEIGHT_DECAY] [--use-old-adam]
[--force-anneal N] [--warmup-updates N]
[--end-learning-rate END_LEARNING_RATE] [--power POWER]
[--total-num-update TOTAL_NUM_UPDATE] [-s SRC]
[-t TARGET] [--load-alignments]
[--left-pad-source BOOL] [--left-pad-target BOOL]
[--max-source-positions N] [--max-target-positions N]
[--upsample-primary UPSAMPLE_PRIMARY]
[--truncate-source] [--num-batch-buckets N]
[--eval-bleu] [--eval-bleu-detok EVAL_BLEU_DETOK]
[--eval-bleu-detok-args JSON] [--eval-tokenized-bleu]
[--eval-bleu-remove-bpe [EVAL_BLEU_REMOVE_BPE]]
[--eval-bleu-args JSON] [--eval-bleu-print-samples]
--langs LANG [--prepend-bos]
data
train_models.py: error: unrecognized arguments: --max-sentences 64
fairseq_prepare_and_train failed after 0.87s.
fairseq_train_and_evaluate_with_parametrization failed after 0.87s.

The code:

for exp_name, kwargs in tqdm(kwargs_dict.items()):
    executor = get_executor(
        cluster='debug',
        slurm_partition='priority',
        submit_decorators=[print_function_name, print_args, print_job_id, print_result, print_running_time],
        timeout_min=2 * 24 * 60,
        slurm_comment='EMNLP Arxiv deadline May 1st',
        gpus_per_node=kwargs['train_kwargs']['ngpus'],
        nodes=1,
        slurm_constraint='volta32gb',
        name=exp_name,
    )
    for i in range(5):
        job = executor.submit(fairseq_train_and_evaluate_with_parametrization, **kwargs)
        jobs_dict[exp_name].append(job)
[job.result() for jobs in jobs_dict.values() for job in jobs]

When cluster is "local" ,train fail too

Hello, when I run the pretrained model I get this error

Hello, when I run the pretrained model I get this error

] [-t TARGET] [--load-alignments]
[--left-pad-source BOOL] [--left-pad-target BOOL]
[--max-source-positions N] [--max-target-positions N]
[--upsample-primary UPSAMPLE_PRIMARY] [--truncate-source]
[--num-batch-buckets N] [--eval-bleu]
[--eval-bleu-detok EVAL_BLEU_DETOK]
[--eval-bleu-detok-args JSON] [--eval-tokenized-bleu]
[--eval-bleu-remove-bpe [EVAL_BLEU_REMOVE_BPE]]
[--eval-bleu-args JSON] [--eval-bleu-print-samples] --langs
LANG [--prepend-bos]
data
simplify.py: error: unrecognized arguments: Notebooks/muss/resources/models/muss_es_mined/model.pt

Unable to process large dataset

I'm trying to run the simplify script on WMT14 dataset. After 2 days of processing, the script failed with the following error. Any fixes?

I do not get any errors when running on a smaller dataset.

Traceback (most recent call last):
  File "scripts/simplify.py", line 28, in <module>
    pred_sentences = simplify_sentences(source_sentences, model_name=args.model_name)
  File ".../muss/simplify.py", line 76, in simplify_sentences
    pred_path = simplifier(source_path)
  File ".../muss/simplifiers.py", line 42, in wrapped
    simplifier(complex_filepath, pred_filepath)
  File ".../muss/simplifiers.py", line 30, in wrapped
    simplifier(complex_filepath, pred_filepath)
  File ".../muss/simplifiers.py", line 68, in preprocessed_simplifier
    preprocessed_pred_filepath = simplifier(preprocessed_complex_filepath)
  File ".../muss/simplifiers.py", line 42, in wrapped
    simplifier(complex_filepath, pred_filepath)
  File ".../muss/simplifiers.py", line 30, in wrapped
    simplifier(complex_filepath, pred_filepath)
  File ".../muss/simplifiers.py", line 54, in fairseq_simplifier
    fairseq_generate(complex_filepath, output_pred_filepath, exp_dir, **kwargs)
  File ".../muss/fairseq/base.py", line 235, in fairseq_generate
    **kwargs,
  File ".../muss/utils/training.py", line 60, in wrapped_func
    return func(*args, **kwargs)
  File ".../muss/fairseq/base.py", line 191, in _fairseq_generate
    predictions = [hypotheses[hypothesis_num - 1] for hypotheses in all_hypotheses]
  File ".../muss/fairseq/base.py", line 191, in <listcomp>
    predictions = [hypotheses[hypothesis_num - 1] for hypotheses in all_hypotheses]
IndexError: list index out of range

Vectorizing new control tokens

This question touches on this issue. I have looked through the source code provided in that issue, but still don't understand how is the model's vocabulary updated when new control tokens are added. Could you please provide a more detailed explanation on that? Also, it is unclear how many unique vocabulary instances are added per token. I know that in access, according to the paper, 40 unique values per control token are added, but what if the new token needs much less values?

Error(s) in loading state_dict for BARTModel

Hello,
Recently I managed to train the model for Portuguese and I had problems with the size of the model dictionaries. After training, I copied the files from the folder '/muss-ptBR/experiments/submitit_logs/4085/' where the model.pt, dict.simple.txt and dict.complex.txt and the file '/resources/datasets/_c234210b2ade71918a75f34527e40d50/sentencepiece.bpe.model' and moved them to the folder 'muss-ptBR/resources/models/muss_pt_mined'. However, when running the simplification script (scripts/simplify.py) I get the following error:

Traceback (most recent call last):
  File "scripts/simplify.py", line 25, in <module>
    pred_sentences = simplify_sentences(source_sentences, model_name=args.model_name)
  File "/home/raphael/muss-ptBR/muss/simplify.py", line 85, in simplify_sentences
    pred_path = simplifier(source_path)
  File "/home/raphael/muss-ptBR/muss/simplifiers.py", line 42, in wrapped
    simplifier(complex_filepath, pred_filepath)
  File "/home/raphael/muss-ptBR/muss/simplifiers.py", line 30, in wrapped
    simplifier(complex_filepath, pred_filepath)
  File "/home/raphael/muss-ptBR/muss/simplifiers.py", line 68, in preprocessed_simplifier
    preprocessed_pred_filepath = simplifier(preprocessed_complex_filepath)
  File "/home/raphael/muss-ptBR/muss/simplifiers.py", line 42, in wrapped
    simplifier(complex_filepath, pred_filepath)
  File "/home/raphael/muss-ptBR/muss/simplifiers.py", line 30, in wrapped
    simplifier(complex_filepath, pred_filepath)
  File "/home/raphael/muss-ptBR/muss/simplifiers.py", line 54, in fairseq_simplifier
    fairseq_generate(complex_filepath, output_pred_filepath, exp_dir, **kwargs)
  File "/home/raphael/muss-ptBR/muss/fairseq/base.py", line 222, in fairseq_generate
    _fairseq_generate(
  File "/home/raphael/muss-ptBR/muss/utils/training.py", line 60, in wrapped_func
    return func(*args, **kwargs)
  File "/home/raphael/muss-ptBR/muss/fairseq/base.py", line 188, in _fairseq_generate
    generate.cli_main()
  File "/home/raphael/.local/lib/python3.8/site-packages/fairseq_cli/generate.py", line 379, in cli_main
    main(args)
  File "/home/raphael/.local/lib/python3.8/site-packages/fairseq_cli/generate.py", line 41, in main
    return _main(args, sys.stdout)
  File "/home/raphael/.local/lib/python3.8/site-packages/fairseq_cli/generate.py", line 88, in _main
    models, _model_args = checkpoint_utils.load_model_ensemble(
  File "/home/raphael/.local/lib/python3.8/site-packages/fairseq/checkpoint_utils.py", line 250, in load_model_ensemble
    ensemble, args, _task = load_model_ensemble_and_task(
  File "/home/raphael/.local/lib/python3.8/site-packages/fairseq/checkpoint_utils.py", line 287, in load_model_ensemble_and_task
    model.load_state_dict(state["model"], strict=strict, args=args)
  File "/home/raphael/.local/lib/python3.8/site-packages/fairseq/models/fairseq_model.py", line 99, in load_state_dict
    return super().load_state_dict(new_state_dict, strict)
  File "/home/raphael/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1604, in load_state_dict
    raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for BARTModel:
        size mismatch for encoder.embed_tokens.weight: copying a param with shape torch.Size([5544, 1024]) from checkpoint, the shape in current model is torch.Size([5597, 1024]).
        size mismatch for decoder.embed_tokens.weight: copying a param with shape torch.Size([5544, 1024]) from checkpoint, the shape in current model is torch.Size([5597, 1024]).
        size mismatch for decoder.output_projection.weight: copying a param with shape torch.Size([5544, 1024]) from checkpoint, the shape in current model is torch.Size([5597, 1024]).

An adaptation I made in the model was to use the mbart50 instead of the mbart25. Do you know any way to resolve this error?

Thanks,

Simplify Failed

After installing all the dependencies (pip install -e .), I launched the python script for simplify the french text given in exemple and the command failed. (python scripts/simplify.py scripts/examples.fr --model-name muss_fr_mined)

python scripts/simplify.py scripts/examples.fr --model-name muss_fr_mined fairseq-generate /home/unapei-muss/admin/tmp/tmp3iwo639d --dataset-impl raw --gen-subset tmp --path /home/unapei-muss/www/muss/resources/models/muss_fr_mined/model.pt --beam 5 --nbest 1 --lenpen 1.0 --diverse-beam-groups -1 --diverse-beam-strength 0.5 --max-tokens 8000 --model-overrides "{'encoder_embed_path': None, 'decoder_embed_path': None}" --skip-invalid-size-inputs-valid-test --task 'translation_from_pretrained_bart' --langs 'ar_AR,cs_CZ,de_DE,en_XX,es_XX,et_EE,fi_FI,fr_XX,gu_IN,hi_IN,it_IT,ja_XX,kk_KZ,ko_KR,lt_LT,lv_LV,my_MM,ne_NP,nl_XX,ro_RO,ru_RU,si_LK,tr_TR,vi_VN,zh_CN' INFO:fairseq_cli.generate:Namespace(all_gather_list_size=16384, batch_size=None, batch_size_valid=None, beam=5, bf16=False, bpe=None, broadcast_buffers=False, bucket_cap_mb=25, checkpoint_shard_count=1, checkpoint_suffix='', constraints=None, cpu=False, criterion='cross_entropy', curriculum=0, data='/home/unapei-muss/admin/tmp/tmp3iwo639d', data_buffer_size=10, dataset_impl='raw', ddp_backend='c10d', decoding_format=None, device_id=0, disable_validation=False, distributed_backend='nccl', distributed_init_method=None, distributed_no_spawn=False, distributed_port=-1, distributed_rank=0, distributed_world_size=1, distributed_wrapper='DDP', diverse_beam_groups=-1, diverse_beam_strength=0.5, diversity_rate=-1.0, empty_cache_freq=0, eval_bleu=False, eval_bleu_args=None, eval_bleu_detok='space', eval_bleu_detok_args=None, eval_bleu_print_samples=False, eval_bleu_remove_bpe=None, eval_tokenized_bleu=False, fast_stat_sync=False, find_unused_parameters=False, fix_batches_to_gpus=False, fixed_validation_seed=None, force_anneal=None, fp16=False, fp16_init_scale=128, fp16_no_flatten_grads=False, fp16_scale_tolerance=0.0, fp16_scale_window=None, gen_subset='tmp', iter_decode_eos_penalty=0.0, iter_decode_force_max_iter=False, iter_decode_max_iter=10, iter_decode_with_beam=1, iter_decode_with_external_reranker=False, langs='ar_AR,cs_CZ,de_DE,en_XX,es_XX,et_EE,fi_FI,fr_XX,gu_IN,hi_IN,it_IT,ja_XX,kk_KZ,ko_KR,lt_LT,lv_LV,my_MM,ne_NP,nl_XX,ro_RO,ru_RU,si_LK,tr_TR,vi_VN,zh_CN', left_pad_source='True', left_pad_target='False', lenpen=1.0, lm_path=None, lm_weight=0.0, load_alignments=False, local_rank=0, localsgd_frequency=3, log_format=None, log_interval=100, lr_scheduler='fixed', lr_shrink=0.1, match_source_len=False, max_len_a=0, max_len_b=200, max_source_positions=1024, max_target_positions=1024, max_tokens=8000, max_tokens_valid=8000, memory_efficient_bf16=False, memory_efficient_fp16=False, min_len=1, min_loss_scale=0.0001, model_overrides="{'encoder_embed_path': None, 'decoder_embed_path': None}", model_parallel_size=1, nbest=1, no_beamable_mm=False, no_early_stop=False, no_progress_bar=False, no_repeat_ngram_size=0, no_seed_provided=False, nprocs_per_node=1, num_batch_buckets=0, num_shards=1, num_workers=1, optimizer=None, path='/home/unapei-muss/www/muss/resources/models/muss_fr_mined/model.pt', pipeline_balance=None, pipeline_checkpoint='never', pipeline_chunks=0, pipeline_decoder_balance=None, pipeline_decoder_devices=None, pipeline_devices=None, pipeline_encoder_balance=None, pipeline_encoder_devices=None, pipeline_model_parallel=False, prefix_size=0, prepend_bos=False, print_alignment=False, print_step=False, profile=False, quantization_config_path=None, quiet=False, remove_bpe=None, replace_unk=None, required_batch_size_multiple=8, required_seq_len_multiple=1, results_path=None, retain_dropout=False, retain_dropout_modules=None, retain_iter_history=False, sacrebleu=False, sampling=False, sampling_topk=-1, sampling_topp=-1.0, score_reference=False, scoring='bleu', seed=1, shard_id=0, skip_invalid_size_inputs_valid_test=True, slowmo_algorithm='LocalSGD', slowmo_momentum=None, source_lang=None, target_lang=None, task='translation_from_pretrained_bart', temperature=1.0, tensorboard_logdir=None, threshold_loss_scale=None, tokenizer=None, tpu=False, train_subset='train', truncate_source=False, unkpen=0, unnormalized=False, upsample_primary=1, user_dir=None, valid_subset='valid', validate_after_updates=0, validate_interval=1, validate_interval_updates=0, warmup_updates=0, zero_sharding='none') INFO:fairseq.tasks.translation:[complex] dictionary: 250001 types INFO:fairseq.tasks.translation:[simple] dictionary: 250001 types INFO:fairseq.data.data_utils:loaded 4 examples from: /home/unapei-muss/admin/tmp/tmp3iwo639d/tmp.complex-simple.complex INFO:fairseq.data.data_utils:loaded 4 examples from: /home/unapei-muss/admin/tmp/tmp3iwo639d/tmp.complex-simple.simple INFO:fairseq.tasks.translation:/home/unapei-muss/admin/tmp/tmp3iwo639d tmp complex-simple 4 examples INFO:fairseq_cli.generate:loading model(s) from /home/unapei-muss/www/muss/resources/models/muss_fr_mined/model.pt Killed

Mining sequences: reading in the CC dataset

Hi,
Thank you for the great repo. I'm trying to mine sequences with scripts/mine_sequences.py, but the data does not get read at all.

Since I was unable to download the data from https://github.com/facebookresearch/cc_net as requested, I downloaded some parts of the French Common Crawl from the OSCAR corpus. When the prompt asks for the path to the data, I point to the directory which has a couple of *.txt.gz files along with the *.jsonl.gz meta data files.

After inspecting the error raised (ValueError: need at least one array to concatenate) I saw none of the data got either read or written.

What should the directory from the input prompt include? That would help me answer these questions:

  • In line 83, the subdirectories of raw_original_dir are checked to find and list *.json.gz files, but there are none there - should they be created in lines 59-63?
  • Similarly, line 106 is looking for *.txt.gz files, but there are none in the dataset_dir directory
  • I also don't understand the condition in line 68: the directories already exist, so none of the jobs will be executed.

Thanks!

git clone Host key verification failed

Hi,
I am trying to install MUSS but every time I get this error: I ran this code git clone [email protected]:facebookresearch/muss.git and the output is

Cloning into 'muss'...
Host key verification failed.
fatal: Could not read from remote repository.

Please make sure you have the correct access rights
and the repository exists.

When I use the code I downloaded last month and install the packages, this error occurs

ERROR: Could not find a version that satisfies the requirement easse (unavailable)
ERROR: No matching distribution found for easse (unavailable)

Thank you in advance!!

pip install returns 'subprocess-exited-with-error' on install of EASSE

Under Python 3.8 and Python 3.7, 'pip install' fails when trying to install EASSE with this message:

$ pip install -e .
Obtaining file:///home/jkurlandski/workspace/randd/ops5g/sentence_simplification/muss
  Preparing metadata (setup.py) ... done
Collecting easse@ git+git://github.com/feralvam/easse.git
  Cloning git://github.com/feralvam/easse.git to /tmp/pip-install-qqwnuhdz/easse_1cf8179ffd214fd8ad9363bc9cc6fe05
  Running command git clone --filter=blob:none --quiet git://github.com/feralvam/easse.git /tmp/pip-install-qqwnuhdz/easse_1cf8179ffd214fd8ad9363bc9cc6fe05
  fatal: unable to connect to github.com:
  github.com[0: 140.82.114.3]: errno=Connection timed out

  error: subprocess-exited-with-error

  × git clone --filter=blob:none --quiet git://github.com/feralvam/easse.git /tmp/pip-install-qqwnuhdz/easse_1cf8179ffd214fd8ad9363bc9cc6fe05 did not run successfully.
  │ exit code: 128
  ╰─> See above for output.

  note: This error originates from a subprocess, and is likely not a problem with pip.
error: subprocess-exited-with-error

× git clone --filter=blob:none --quiet git://github.com/feralvam/easse.git /tmp/pip-install-qqwnuhdz/easse_1cf8179ffd214fd8ad9363bc9cc6fe05 did not run successfully.
│ exit code: 128
╰─> See above for output.

note: This error originates from a subprocess, and is likely not a problem with pip.

mining paraphrases fails with time-out error

Hi,

I'm currently trying to generate paraphrase corpora from cc_net using mine_sequences.py script. As a test run, I was hoping to mine sentence pairs from just a couple of cc_net corpus files (e.g. 0000/en_head.json.gz, 0001/en_head.json.gz). However, the job fails during mining.

I haven't been able to find any answers in the issues so far and would appreciate any guidance on solving this! I've attached the script's output and relevant log/error files.

Disclaimer: I was trying to run this on a single NVIDIA GeForce GTX TITAN X (12GB). Not sure if that would make a difference. Are there any minimum hardware/system requirements?

Thanks in advance!

mine_sequences.out.txt
31066_0_log.out.txt
31066_0_log.err.txt

Train/adapt to other languages

Hi!

I see that it is possible to use MUSS with other languages:

If you are going to add a new language to this project, in folder resources/models/language_models/wikipedia donwload the files of the target language from https://huggingface.co/edugp/kenlm/tree/main/wikipedia. These language models are used to filter high quality sentences in the paraphrase mining phase.

But what if the target language is not listed in the kenlm repository? I would like to try this system on Italian

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.