Comments (18)
With the latest commit on master, you can change the following options to get almost the same output as fairseq:
- with API:
normalize_scores=True, allow_early_exit=False
- with CLI:
--normalize_scores --disable_early_exit
(Note that disabling early exit slightly reduces the translation speed.)
The few remaining differences are mostly due to small variations in the beam search implementation.
from ctranslate2.
if I export the weights to the bin format, how is the layers ordering maintained, they are still present in ctranslate2 core.
Each weight has a unique name. That's how the order is preserved between the bin format and the core implementation.
Is dropout needed during inference?
Dropout is not used during difference.
How to start writing a converter when there are many changes as these?
Here you are comparing Pre-Norm Transformers (OpenNMT default) and Post-Norm Transformers (fairseq default).
The converter and core implementation supports both variants since this commit: 316ed26
Additionally, the V2 branch slightly improves the converter design to make it easier to implement new converters. We hope to release this version in the coming days or weeks.
from ctranslate2.
Assuming fairseq trains compatible Transformer models (same computation graph), you can write a converter to extract the weights.
Also the model sizes are of transformer_big?
Which model are you referring to? The benchmarks numbers come from a Transformer Base.
Please consider distilling the model into smaller model that would help for inference and size.
Users can simply decide to apply distillation before using this project.
from ctranslate2.
I found that in Fairseq the first token forwarded into the decoder is </s>
instead of <s>
. This requires a small change in the core implementation.
I implemented a converter using the Fairseq API to support multiple checkpoint versions (see the PR above). I translated from their WMT19 model and it seems to work but more testing would be required.
from ctranslate2.
@gvskalyan Can you help testing? You can grab the latest wheels from the CI: download the artifact python-wheels
, unzip, and pip install the appropriate wheel. Thanks!
from ctranslate2.
Thanks for testing!
I checked further and found the output is the same with greedy search (beam_size=1) but not with beam search. There are indeed small differences in the beam search implementation, for example:
- fairseq is sorting hypotheses by the token-level score and not the sentence-level score
- fairseq is decoding until
beam_size
hypotheses are finished while CTranslate2 has a condition to exit earlier.
We can easily add new translation options to match the fairseq behavior, but we will probably not change the default behavior.
Since the model conversion is working fine, I will merge the PR and add these options in separate PRs. I will summarize here the options you can enable to close the gap with fairseq.
from ctranslate2.
File "/home/user/.local/lib/python3.6/site-packages/ctranslate2/converters/opennmt_py.py", line 25, in _load
variables["generator.weight"] = checkpoint["generator"]["0.weight"]
KeyError: 'generator
I am trying to convert fairseq based transformer model
from ctranslate2.
fairseq models are not supported, see the README:
https://github.com/OpenNMT/CTranslate2#converting-models
from ctranslate2.
I wonder if there are plans of supporting of conversion to ctranslate2 from Fairseq transformer models. It is a widely used training framework and supports some new pertained models like BART as well. @guillaumekln
from ctranslate2.
We will probably get to that. Do you want to help? We should first determine the architecture differences with the current Transformer implementation (if any) and then add a converter for these checkpoints.
from ctranslate2.
@guillaumekln I took a quick look into fairSeq weights names and the ones in opennmt-py. I noticed that opennmt-py models have "encoder.layer_norm.a_2", "encoder.layer_norm.b_2" which I don't think I have a corresponding mapping in fairseq models. In Fairseq, there are only (use encoder as examples)
'encoder.layers.{layer_number}.self_attn_layer_norm.weight',
'encoder.layers.{layer_number}.self_attn_layer_norm.bias',
'encoder.layers.{layer_number}.final_layer_norm.weight',
'encoder.layers.{layer_number}.final_layer_norm.bias'
from ctranslate2.
Hi,
There are architectural/implementation level changes in fairseq's transformer.
I have a basic doubt in the converter script, if I export the weights to the bin format, how is the layers ordering maintained, they are still present in ctranslate2 core.
Is dropout needed during inference? Is it present in ctranslate2?
Differences example:
a) Having a layer norm is optional in fairseq for the entire encoder or decoder. (src -> embeddings(src) -> layer(out) -> layer_norm(out) (Optional))
b) Each encoder layer in fairseq ----> self_attn -> Residual self_attn_(op+ip) -> ffn(x) -> Residual x+ffn_inputs -> layer norm (Default implementation)
Each encoder layer in OpenNMT-py ----> norm -> self_attn -> Residual context+inputs -> ffn(out)
Each decoder layer in OpennMT-py ----> norm1 -> self_attn -> self_op+inputs -> norm2 -> context_attn -> context_op+self_op -> ffn(out)
@guillaumekln How to start writing a converter when there are many changes as these?
from ctranslate2.
Let's reopen this issue for visibility. I'm currently trying to add a converter and see if we are missing anything else.
from ctranslate2.
Hi @guillaumekln , I have used this converter with the pretrained model available here from the page , but the generations using extracted weights is not producing correct translations. I have verified the variable names and number and everything seems correct. Am i missing something?
I am using recently released v2.0.0. for this change
from ctranslate2.
@guillaumekln I will be able to help to test a case for myself as well
from ctranslate2.
Hi @guillaumekln , the translation with the latest wheel from CI seems proper, I tested it on fairseq wmt19 models and observed the following, are there any other validation tests to be done?
Is the drop/variation in BLEU score expected, can this be attributed to the architecture of these models (transformer-big)?
The test sets are of WMT14 and used SacreBLEU
Direction | Checkpoint Name | BLEU - Fairseq | BLEU - CTranslate2 | BLEU - CTranslate2 (edit1) | Checkpoint Md5sum - Fairseq | Checkpoint Md5sum - CTranslate2 |
---|---|---|---|---|---|---|
en-de | wmt19.en-de.joined-dict.ensemble/model1 | 35.73 | 35.56 | 35.36 | f19aa2206748c59a0c5719386bc9a967 | 6eed02afd5be4eb949ef43db389a6806 |
de-en | wmt19.de-en.joined-dict.ensemble/model1 | 38.18 | 37.75 | 37.91 | 2f4e81d6cfd3385d37883a85010f88f4 | 3dcf2ab1f1685e2f2794664e3fdf69e6 |
en-ru | wmt19.en-ru.ensemble/model1 | 43.14 | 42.46 | 42.51 | 11f261a927258abf35cc466cf3ca508e | 24f3aac0dca6e65606d90d5377c34d76 |
ru-en | wmt19.ru-en.ensemble/model1 | 41.74 | 41.06 | 41.35 | 59a52df2f09c8ddb01d4c9f5692358ee | afd846d3956bb2a43a6cd1f221f8fc3c |
Update - edit 1 : scores with --normalize_scores --disable_early_exit
from ctranslate2.
Hi @gvskalyan, I see just now your updated table in #11 (comment). We don't expect this difference. Did you set the same beam size in this comparison?
from ctranslate2.
The few remaining differences are mostly due to small variations in the beam search implementation.
For reference, #911 should fix the remaining differences.
from ctranslate2.
Related Issues (20)
- DirectML Support HOT 2
- Output tokens logits HOT 2
- Asynchronous execution: High latency when retrieving results HOT 1
- ValueError: Vocabulary has size 32001 but the model expected a vocabulary of size 32000 HOT 5
- nning this: HOT 2
- Likely bug: log_prob is not affected by sampling_temperature HOT 1
- Please add Unbabel Tower - Translation Model support
- When is flash attention supported HOT 2
- Please support gemma arch HOT 8
- Problem with GPU allocation after updating to CTranslate2 4.0.0 HOT 1
- [INFO] cuDNN with fusions HOT 1
- Whisper models no longer work on Windows cuda after CTranslate2 4.0.0 HOT 2
- Intel Advanced Matrix Extensions (AMX) support HOT 1
- System error while converting Aya101 HOT 4
- Update the pip package HOT 1
- PROBLEM converting phi2, phi1.5, and phi1 models HOT 8
- Translator.unload_model(to_cpu=True) takes long time HOT 8
- `translate_iterable` is not properly handling `max_input_length` HOT 2
- Whisper batch generation is not faster than loops HOT 5
- Support for Zephyr and other "StableLmForCausalLM" models? HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from ctranslate2.