๐ฅ๐ฅ๐ฅ Official release of pretrained models and scripts for โ 'Golden Gemini Is All You Need: Finding the Sweet Spots for Speaker Verification' accepted by IEEE/ACM Transactions on Audio, Speech, and Language Processing (TASLP), 2024.
(Free access) IEEE link: https://ieeexplore.ieee.org/document/10497864
arXiv Link: https://arxiv.org/abs/2312.03620
-
[Important] This repository is dedicated to sharing the pretrained models in our paper for connvinent usage. For training and inference, we recommend using the Gemini DF-ResNet, now available on WeSpeaker: https://github.com/wenet-e2e/wespeaker/tree/master/examples/voxceleb/v2. We extend our gratitude to the WeSpeaker community for their support, with special thanks to Dr. Wang Shuai.
-
Special thanks to Dr. Liu Bei for sharing the implementation details related to DF-ResNet (https://ieeexplore.ieee.org/document/10119228).
-
[New] We also release a large-margin finetuned pretrained model.
Model | Param | Large Margin Fine-Tuning | Vox1-O EER | Vox1-O MinDCF | Vox1-E EER | Vox1-E MinDCF | Vox1-H EER | Vox1-H MinDCF | Pretained Model Folder |
---|---|---|---|---|---|---|---|---|---|
Gemini DF-ResNet60 [Google Drive] | 4.05 | X | 0.941 | 0.089 | 1.051 | 0.116 | 1.799 | 0.166 | 0611-Gemini_df_resnet56-emb256-fbank80-num_frms200-aug0.6-spTrue-saFalse-ArcMargin-AdamW-epoch165 |
Gemini DF-ResNet114 [Google Drive] | 6.53 | X | 0.686 | 0.067 | 0.863 | 0.097 | 1.490 | 0.144 | 0615-Gemini_df_resnet110-emb256-fbank80-num_frms200-aug0.6-spTrue-saFalse-ArcMargin-AdamW-epoch165 |
Gemini DF-ResNet183 [Google Drive] | 9.20 | X | 0.596 | 0.065 | 0.806 | 0.090 | 1.440 | 0.137 | 0621-Gemini_df_resnet179-emb256-fbank80-num_frms200-aug0.6-spTrue-saFalse-ArcMargin-AdamW-epoch165 |
[New] Gemini DF-ResNet183 [Google Drive] | 9.20 | โ | 0.569 | 0.045 | 0.768 | 0.078 | 1.342 | 0.126 | 0621-Gemini_df_resnet179-emb256-fbank80-num_frms200-aug0.6-spTrue-saFalse-ArcMargin-AdamW-epoch165-LM |
*: The layer count between the names of the pretrained model folder and the model itself differs by 4, as mentioned in footnote 4 of the paper, where the distinction lies in whether to include the 4 separate downsampling layers in the layer count. The models are identical, only differing in nomenclature. During experimentation, we did not include the separate downsampling layers in the layer count; however, through discussion during paper writing, we decided to include. Therefore, Gemini DF-ResNet60/114/183 are the official name.
Take 0611-Gemini_df_resnet56-emb256-fbank80-num_frms200-aug0.6-spTrue-saFalse-ArcMargin-AdamW-epoch165 as an example:
โโโ models/
โ โโโ model_165.pt # The model checkpoint of 165 epoch. This is the model for testing.
โโโ scores/
โ โโโ vox1_asnorm300_result # testing results with asnorm
โ โโโ vox1_snorm300_result # testing results with snorm
โ โโโ vox1_cos_result # testing results simply by cosine similarity
โ โโโ vox2_dev_asnorm300_vox1_O_cleaned.kaldi.det.png # visualization
โ โโโ vox2_dev_asnorm300_vox1_H_cleaned.kaldi.det.png
โ โโโ vox2_dev_asnorm300_vox1_E_cleaned.kaldi.det.png
โ โโโ vox2_dev_asnorm300_vox1_O_cleaned.kaldi.score # scores of all the trials
โ โโโ vox2_dev_asnorm300_vox1_H_cleaned.kaldi.score
โ โโโ vox2_dev_asnorm300_vox1_E_cleaned.kaldi.score
โโโ config.yaml # The config file to train the model in wespeaker platform (https://github.com/wenet-e2e/wespeaker/tree/master/examples/voxceleb/v2/conf)
โโโ train.log # The training log automatically generated by the Wespeaker toolkit.
[Important] ๐ฅ๐ฅ๐ฅ The Gemini DF-ResNet is now available in Wespeaker! (https://github.com/wenet-e2e/wespeaker/tree/master/examples/voxceleb/v2). We encourage using the version provided by Wespeaker for better compatibility. Additionally, you can find the large-margin finetuned pretrained models in both Pytorch and ONNX formats at: https://github.com/wenet-e2e/wespeaker/blob/master/docs/pretrained.md
Alternatively, you can follow the steps below to reproduce the provided checkpoint:
- Set up Wespeaker Toolkit (https://github.com/wenet-e2e/wespeaker).
- Copy the model file, Gemini_df_resnet.py, from this repository to wespeaker/wespeaker/models/
- Modify wespeaker/wespeaker/models/speaker_model.py by adding
and
import wespeaker.models.Gemini_df_resnet as Gemini_df_resnet
elif model_name.startswith("Gemini_df_resnet"): return getattr(Gemini_df_resnet, model_name)
- Create a config file following the config.yaml file in the pre-trained model folder and place it in /wespeaker/examples/voxceleb/v2/conf/
[Note]: Warm-up is not explicitly stated in the configuration file, yet it is employed by default in the Wespeaker toolkit as follows:
warm_from_zero: False warm_up_epoch: 6
- In wespeaker/examples/voxceleb/v2/run.sh, modify 'config' to point to the new config file. And then you can start training and reproduce.
๐ฅโGolden Gemini (this work):
@ARTICLE{10497864,
author={Liu, Tianchi and Lee, Kong Aik and Wang, Qiongqiong and Li, Haizhou},
journal={IEEE/ACM Transactions on Audio, Speech, and Language Processing},
title={Golden Gemini is All You Need: Finding the Sweet Spots for Speaker Verification},
year={2024},
volume={32},
number={},
pages={2324-2337}
}
(Prior Work) RecXi (Golden Gemini is a continued research of tResNet)
@inproceedings{NEURIPS2023_9d276b0a,
author = {Liu, Tianchi and Lee, Kong Aik and Wang, Qiongqiong and Li, Haizhou},
booktitle = {Advances in Neural Information Processing Systems (NeurIPS)},
pages = {50221--50236},
title = {Disentangling Voice and Content with Self-Supervision for Speaker Recognition},
volume = {36},
year = {2023}
}
(Related Work) DF-ResNet
@ARTICLE{10119228,
author={Liu, Bei and Chen, Zhengyang and Qian, Yanmin},
journal={IEEE/ACM Transactions on Audio, Speech, and Language Processing},
title={Depth-First Neural Architecture With Attentive Feature Fusion for Efficient Speaker Verification},
year={2023},
volume={31},
pages={1825-1838}
}
(Related Work) wespeaker toolkit
@INPROCEEDINGS{10096626,
author={Wang, Hongji and Liang, Chengdong and Wang, Shuai and Chen, Zhengyang and Zhang, Binbin and Xiang, Xu and Deng, Yanlei and Qian, Yanmin},
booktitle={ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
title={Wespeaker: A Research and Production Oriented Speaker Embedding Learning Toolkit},
year={2023},
pages={1-5}
}
@article{wang4748855advancing,
title={Advancing Speaker Embedding Learning: Wespeaker Toolkit for Research and Production},
author={Wang, Shuai and Chen, Zhengyang and Han, Bing and Wang, Hongji and Liang, Chengdong and Zhang, Binbin and Xiang, Xu and Ding, Wen and Rohdin, Johan and Silnova, Anna and others},
journal={Available at SSRN 4748855}
}