Giter VIP home page Giter VIP logo

ictnlp / streamspeech Goto Github PK

View Code? Open in Web Editor NEW
799.0 12.0 60.0 18.67 MB

StreamSpeech is an “All in One” seamless model for offline and simultaneous speech recognition, speech translation and speech synthesis.

Home Page: https://ictnlp.github.io/StreamSpeech-site/

License: MIT License

Makefile 0.01% Python 86.29% Batchfile 0.02% Dockerfile 0.01% Shell 2.38% Jupyter Notebook 10.30% C++ 0.29% C 0.01% Cuda 0.49% Perl 0.03% Cython 0.13% Lua 0.04%
seamless simultaneous-translation speech speech-recognition speech-synthesis speech-to-text speech-translation translation all-in-one machine-translation streaming-audio text-to-speech asr tts voice text-to-audio non-autoregressive speech-enhancement audio-processing speech-processing

streamspeech's Introduction

StreamSpeech

arXiv project model Hits

twitter twitter

Authors: Shaolei Zhang, Qingkai Fang, Shoutao Guo, Zhengrui Ma, Min Zhang, Yang Feng*

Code for ACL 2024 paper "StreamSpeech: Simultaneous Speech-to-Speech Translation with Multi-task Learning".

StreamSpeech

🎧 Listen to StreamSpeech's translated speech 🎧

💡Highlight:

  1. StreamSpeech achieves SOTA performance on both offline and simultaneous speech-to-speech translation.
  2. StreamSpeech performs streaming ASR, simultaneous speech-to-text translation and simultaneous speech-to-speech translation via an "All in One" seamless model.
  3. StreamSpeech can present intermediate results (i.e., ASR or translation results) during simultaneous translation, offering a more comprehensive low-latency communication experience.

🔥News

⭐Features

Support 8 Tasks

  • Offline: Speech Recognition (ASR)✅, Speech-to-Text Translation (S2TT)✅, Speech-to-Speech Translation (S2ST)✅, Speech Synthesis (TTS)✅
  • Simultaneous: Streaming ASR✅, Simultaneous S2TT✅, Simultaneous S2ST✅, Real-time TTS✅ under any latency (with one model)

GUI Demo

demo.mov

Simultaneously provide ASR, translation, and synthesis results via a seamless model

Case

Speech Input: example/wavs/common_voice_fr_17301936.mp3

Transcription (ground truth): jai donc lexpérience des années passées jen dirai un mot tout à lheure

Translation (ground truth): i therefore have the experience of the passed years i'll say a few words about that later

StreamSpeech Simultaneous Offline
Speech Recognition jai donc expérience des années passé jen dirairai un mot tout à lheure jai donc lexpérience des années passé jen dirairai un mot tout à lheure
Speech-to-Text Translation i therefore have an experience of last years i will tell a word later so i have the experience in the past years i'll say a word later
Speech-to-Speech Translation
simul-s2st.mov
offline-s2st.mov
Text-to-Speech Synthesis (incrementally synthesize speech word by word)
simul-tts.mov
offline-tts.mov

⚙Requirements

  • Python == 3.10, PyTorch == 2.0.1, Install fairseq & SimulEval

    cd fairseq
    pip install --editable ./ --no-build-isolation
    cd SimulEval
    pip install --editable ./

🚀Quick Start

1. Model Download

(1) StreamSpeech Models

Language UnitY StreamSpeech (offline) StreamSpeech (simultaneous)
Fr-En unity.fr-en.pt [Huggingface] [Baidu] streamspeech.offline.fr-en.pt [Huggingface] [Baidu] streamspeech.simultaneous.fr-en.pt [Huggingface] [Baidu]
Es-En unity.es-en.pt [Huggingface] [Baidu] streamspeech.offline.es-en.pt [Huggingface] [Baidu] streamspeech.simultaneous.es-en.pt [Huggingface] [Baidu]
De-En unity.de-en.pt [Huggingface] [Baidu] streamspeech.offline.de-en.pt [Huggingface] [Baidu] streamspeech.simultaneous.de-en.pt [Huggingface] [Baidu]

(2) Unit-based HiFi-GAN Vocoder

Unit config Unit size Vocoder language Dataset Model
mHuBERT, layer 11 1000 En LJSpeech ckpt, config

2. Prepare Data and Config (only for test/inference)

(1) Config Files

Replace /data/zhangshaolei/StreamSpeech in files configs/fr-en/config_gcmvn.yaml and configs/fr-en/config_mtl_asr_st_ctcst.yaml with your local address of StreamSpeech repo.

(2) Test Data

Prepare test data following SimulEval format. example/ provides an example:

  • wav_list.txt: Each line records the path of a source speech.
  • target.txt: Each line records the reference text, e.g., target translation or source transcription (used to calculate the metrics).

3. Inference with SimulEval

Run these scripts to inference StreamSpeech on streaming ASR, simultaneous S2TT and simultaneous S2ST.

--source-segment-size: set the chunk size (millisecond) to any value to control the latency

Simultaneous Speech-to-Speech Translation

--output-asr-translation: whether to output the intermediate ASR and translated text results during simultaneous speech-to-speech translation.

export CUDA_VISIBLE_DEVICES=0

ROOT=/data/zhangshaolei/StreamSpeech # path to StreamSpeech repo
PRETRAIN_ROOT=/data/zhangshaolei/pretrain_models 
VOCODER_CKPT=$PRETRAIN_ROOT/unit-based_HiFi-GAN_vocoder/mHuBERT.layer11.km1000.en/g_00500000 # path to downloaded Unit-based HiFi-GAN Vocoder
VOCODER_CFG=$PRETRAIN_ROOT/unit-based_HiFi-GAN_vocoder/mHuBERT.layer11.km1000.en/config.json # path to downloaded Unit-based HiFi-GAN Vocoder

LANG=fr
file=streamspeech.simultaneous.${LANG}-en.pt # path to downloaded StreamSpeech model
output_dir=$ROOT/res/streamspeech.simultaneous.${LANG}-en/simul-s2st

chunk_size=320 #ms
PYTHONPATH=$ROOT/fairseq simuleval --data-bin ${ROOT}/configs/${LANG}-en \
    --user-dir ${ROOT}/researches/ctc_unity --agent-dir ${ROOT}/agent \
    --source example/wav_list.txt --target example/target.txt \
    --model-path $file \
    --config-yaml config_gcmvn.yaml --multitask-config-yaml config_mtl_asr_st_ctcst.yaml \
    --agent $ROOT/agent/speech_to_speech.streamspeech.agent.py \
    --vocoder $VOCODER_CKPT --vocoder-cfg $VOCODER_CFG --dur-prediction \
    --output $output_dir/chunk_size=$chunk_size \
    --source-segment-size $chunk_size \
    --quality-metrics ASR_BLEU  --target-speech-lang en --latency-metrics AL AP DAL StartOffset EndOffset LAAL ATD NumChunks DiscontinuitySum DiscontinuityAve DiscontinuityNum RTF \
    --device gpu --computation-aware \
    --output-asr-translation True

You should get the following outputs:

fairseq plugins loaded...
fairseq plugins loaded...
fairseq plugins loaded...
fairseq plugins loaded...
2024-06-06 09:45:46 | INFO     | fairseq.tasks.speech_to_speech | dictionary size: 1,004
import agents...
Removing weight norm...
2024-06-06 09:45:50 | INFO     | agent.tts.vocoder | loaded CodeHiFiGAN checkpoint from /data/zhangshaolei/pretrain_models/unit-based_HiFi-GAN_vocoder/mHuBERT.layer11.km1000.en/g_00500000
2024-06-06 09:45:50 | INFO     | simuleval.utils.agent | System will run on device: gpu.
2024-06-06 09:45:50 | INFO     | simuleval.dataloader | Evaluating from speech to speech.
  0%|                                                                                                                                                                              | 0/2 [00:00<?, ?it/s]
Streaming ASR: 
Streaming ASR: 
Streaming ASR: je
Simultaneous translation: i would
Streaming ASR: je voudrais
Simultaneous translation: i would like to
Streaming ASR: je voudrais soumettre
Simultaneous translation: i would like to sub
Streaming ASR: je voudrais soumettre cette
Simultaneous translation: i would like to submit
Streaming ASR: je voudrais soumettre cette idée
Simultaneous translation: i would like to submit this
Streaming ASR: je voudrais soumettre cette idée à la
Simultaneous translation: i would like to submit this idea to
Streaming ASR: je voudrais soumettre cette idée à la réflexion
Simultaneous translation: i would like to submit this idea to the
Streaming ASR: je voudrais soumettre cette idée à la réflexion de
Simultaneous translation: i would like to submit this idea to the reflection
Streaming ASR: je voudrais soumettre cette idée à la réflexion de lassemblée
Simultaneous translation: i would like to submit this idea to the reflection of
Streaming ASR: je voudrais soumettre cette idée à la réflexion de lassemblée nationale
Simultaneous translation: i would like to submit this idea to the reflection of the
Streaming ASR: je voudrais soumettre cette idée à la réflexion de lassemblée nationale
Simultaneous translation: i would like to submit this idea to the reflection of the national assembly
 50%|███████████████████████████████████████████████████████████████████████████████████                                                                                   | 1/2 [00:04<00:04,  4.08s/it]
Streaming ASR: 
Streaming ASR: 
Streaming ASR: 
Streaming ASR: 
Streaming ASR: jai donc
Simultaneous translation: i therefore
Streaming ASR: jai donc
Streaming ASR: jai donc expérience des
Simultaneous translation: i therefore have an experience
Streaming ASR: jai donc expérience des années
Streaming ASR: jai donc expérience des années passé
Simultaneous translation: i therefore have an experience of last
Streaming ASR: jai donc expérience des années passé jen
Simultaneous translation: i therefore have an experience of last years
Streaming ASR: jai donc expérience des années passé jen dirairai
Simultaneous translation: i therefore have an experience of last years i will
Streaming ASR: jai donc expérience des années passé jen dirairai un mot
Simultaneous translation: i therefore have an experience of last years i will tell a
Streaming ASR: jai donc expérience des années passé jen dirairai un mot tout à lheure
Simultaneous translation: i therefore have an experience of last years i will tell a word
Streaming ASR: jai donc expérience des années passé jen dirairai un mot tout à lheure
Simultaneous translation: i therefore have an experience of last years i will tell a word later
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:06<00:00,  3.02s/it]
2024-06-06 09:45:56 | WARNING  | simuleval.scorer.asr_bleu | Beta feature: Evaluating speech output. Faieseq is required.
2024-06-06 09:46:12 | INFO | fairseq.tasks.audio_finetuning | Using dict_path : /data/zhangshaolei/.cache/ust_asr/en/dict.ltr.txt
Transcribing predictions: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:01<00:00,  1.63it/s]
2024-06-06 09:46:21 | INFO     | simuleval.sentence_level_evaluator | Results:
 ASR_BLEU       AL    AL_CA    AP  AP_CA      DAL  DAL_CA  StartOffset  StartOffset_CA  EndOffset  EndOffset_CA     LAAL  LAAL_CA      ATD   ATD_CA  NumChunks  NumChunks_CA  DiscontinuitySum  DiscontinuitySum_CA  DiscontinuityAve  DiscontinuityAve_CA  DiscontinuityNum  DiscontinuityNum_CA   RTF  RTF_CA
   15.448 1724.895 2913.508 0.425  0.776 1358.812 3137.55       1280.0        2213.906     1366.0        1366.0 1724.895 2913.508 1440.146 3389.374        9.5           9.5             110.0                110.0              55.0                 55.0                 1                    1 1.326   1.326

Logs and evaluation results are stored in $output_dir/chunk_size=$chunk_size:

$output_dir/chunk_size=$chunk_size
├── wavs/
│   ├── 0_pred.wav # generated speech
│   ├── 1_pred.wav 
│   ├── 0_pred.txt # asr transcription for ASR-BLEU tookit
│   ├── 1_pred.txt 
├── config.yaml
├── asr_transcripts.txt # ASR-BLEU transcription results
├── metrics.tsv
├── scores.tsv
├── asr_cmd.bash
└── instances.log # logs of Simul-S2ST
Simultaneous Speech-to-Text Translation
export CUDA_VISIBLE_DEVICES=0

ROOT=/data/zhangshaolei/StreamSpeech # path to StreamSpeech repo

LANG=fr
file=streamspeech.simultaneous.${LANG}-en.pt # path to downloaded StreamSpeech model
output_dir=$ROOT/res/streamspeech.simultaneous.${LANG}-en/simul-s2tt

chunk_size=320 #ms
PYTHONPATH=$ROOT/fairseq simuleval --data-bin ${ROOT}/configs/${LANG}-en \
    --user-dir ${ROOT}/researches/ctc_unity --agent-dir ${ROOT}/agent \
    --source example/wav_list.txt --target example/target.txt \
    --model-path $file \
    --config-yaml config_gcmvn.yaml --multitask-config-yaml config_mtl_asr_st_ctcst.yaml \
    --agent $ROOT/agent/speech_to_text.s2tt.streamspeech.agent.py\
    --output $output_dir/chunk_size=$chunk_size \
    --source-segment-size $chunk_size \
    --quality-metrics BLEU  --latency-metrics AL AP DAL StartOffset EndOffset LAAL ATD NumChunks RTF \
    --device gpu --computation-aware 
Streaming ASR
export CUDA_VISIBLE_DEVICES=0

ROOT=/data/zhangshaolei/StreamSpeech # path to StreamSpeech repo

LANG=fr
file=streamspeech.simultaneous.${LANG}-en.pt # path to downloaded StreamSpeech model
output_dir=$ROOT/res/streamspeech.simultaneous.${LANG}-en/streaming-asr

chunk_size=320 #ms
PYTHONPATH=$ROOT/fairseq simuleval --data-bin ${ROOT}/configs/${LANG}-en \
    --user-dir ${ROOT}/researches/ctc_unity --agent-dir ${ROOT}/agent \
    --source example/wav_list.txt --target example/source.txt \
    --model-path $file \
    --config-yaml config_gcmvn.yaml --multitask-config-yaml config_mtl_asr_st_ctcst.yaml \
    --agent $ROOT/agent/speech_to_text.asr.streamspeech.agent.py\
    --output $output_dir/chunk_size=$chunk_size \
    --source-segment-size $chunk_size \
    --quality-metrics BLEU  --latency-metrics AL AP DAL StartOffset EndOffset LAAL ATD NumChunks RTF \
    --device gpu --computation-aware 

🎈Develop Your Own StreamSpeech

1. Data Preprocess

2. Training

Note

You can directly use the downloaded StreamSpeech model for evaluation and skip training.

model

Model --user-dir --arch Description
Translatotron 2 researches/translatotron s2spect2_conformer_modified Translatotron 2
UnitY researches/translatotron unity_conformer_modified UnitY
Uni-UnitY researches/uni_unity uni_unity_conformer Change all encoders in UnitY into unidirectional
Chunk-UnitY researches/chunk_unity chunk_unity_conformer Change the Conformer in UnitY into Chunk-based Conformer
StreamSpeech researches/ctc_unity streamspeech StreamSpeech
StreamSpeech (cascade) researches/ctc_unity streamspeech_cascade Cascaded StreamSpeech of S2TT and TTS. TTS module can be used independently for real-time TTS given incremental text.
HMT researches/hmt hmt_transformer_iwslt_de_en HMT: strong simultaneous text-to-text translation method
DiSeg researches/diseg convtransformer_espnet_base_seg DiSeg: strong simultaneous speech-to-text translation method

Tip

The train_scripts/ and test_scripts/ in directory --user-dir give the training and testing scripts for each model. Refer to official repo of UnitY, Translatotron 2, HMT and DiSeg for more details.

3. Evaluation

(1) Offline Evaluation

Follow pred.offline-s2st.sh to evaluate the offline performance of StreamSpeech on ASR, S2TT and S2ST.

(2) Simultaneous Evaluation

A trained StreamSpeech model can be used for streaming ASR, simultaneous speech-to-text translation and simultaneous speech-to-speech translation. We provide agent/ for these three tasks:

  • agent/speech_to_speech.streamspeech.agent.py: simultaneous speech-to-speech translation
  • agent/speech_to_text.s2tt.streamspeech.agent.py: simultaneous speech-to-text translation
  • agent/speech_to_text.asr.streamspeech.agent.py: streaming ASR

Follow simuleval.simul-s2st.sh, simuleval.simul-s2tt.sh, simuleval.streaming-asr.sh to evaluate StreamSpeech.

4. Our Results

Our project page (https://ictnlp.github.io/StreamSpeech-site/) provides some translated speech generated by StreamSpeech, listen to it 🎧.

(1) Offline Speech-to-Speech Translation ( ASR-BLEU: quality )

offline

(2) Simultaneous Speech-to-Speech Translation ( AL: latency | ASR-BLEU: quality )

simul

(3) Simultaneous Speech-to-Text Translation ( AL: latency | BLEU: quality )

simul

(4) Streaming ASR ( AL: latency | WER: quality )

simul

🖋Citation

If you have any questions, please feel free to submit an issue or contact [email protected].

If our work is useful for you, please cite as:

@inproceedings{streamspeech,
      title={StreamSpeech: Simultaneous Speech-to-Speech Translation with Multi-task Learning}, 
      author={Shaolei Zhang and Qingkai Fang and Shoutao Guo and Zhengrui Ma and Min Zhang and Yang Feng},
      year={2024},
      booktitle = {Proceedings of the 62th Annual Meeting of the Association for Computational Linguistics (Long Papers)},
      publisher = {Association for Computational Linguistics}
}

streamspeech's People

Contributors

zhangshaolei1998 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

streamspeech's Issues

pip install fairseq fails due to invalid metadata for PyYAML dependency in omegaconf

Title

pip install fairseq fails due to invalid metadata for PyYAML dependency in omegaconf

Issue Description

Environment:

  • Ubuntu 22.04
  • Conda environment: streamspeech
  • Python 3.10
  • GCC version: 12.3.0

Steps to Reproduce:

  1. Create and activate conda environment:

    conda create -n streamspeech python=3.10 -y
    conda activate streamspeech
  2. Install dependencies:

    conda install pytorch=2.0.1 torchvision torchaudio cudatoolkit=11.7 -c pytorch
    sudo apt-get update
    sudo apt-get install -y build-essential python3-dev
    pip install cython numpy
    pip install --upgrade pip setuptools
    pip install PyYAML
  3. Check PyYAML version:

    python -c "import yaml; print(yaml.__version__)"
    # Output: 6.0.1
  4. Try to install fairseq:

    pip install --editable ./ --no-build-isolation

Observed Behavior:

  • The installation process fails with the following errors:
    WARNING: Error parsing dependencies of omegaconf: .* suffix can only be used with `==` or `!=` operators
        PyYAML (>=5.1.*)
                ~~~~~~^
    
    ERROR: Exception:
    Traceback (most recent call last):
      File "/root/exit/envs/streamspeech/lib/python3.10/site-packages/pip/_vendor/pkg_resources/__init__.py", line 3070, in _dep_map
        return self.__dep_map
      File "/root/exit/envs/streamspeech/lib/python3.10/site-packages/pip/_vendor/pkg_resources/__init__.py", line 2863, in __getattr__
        raise AttributeError(attr)
    AttributeError: _DistInfoDistribution__dep_map
    
    During handling of the above exception, another exception occurred:
    
    Traceback (most recent call last):
      File "/root/exit/envs/streamspeech/lib/python3.10/site-packages/pip/_vendor/packaging/requirements.py", line 36, in __init__
        parsed = _parse_requirement(requirement_string)
      File "/root/exit/envs/streamspeech/lib/python3.10/site-packages/pip/_vendor/packaging/_parser.py", line 62, in parse_requirement
        return _parse_requirement(Tokenizer(source, rules=DEFAULT_RULES))
      File "/root/exit/envs/streamspeech/lib/python3.10/site-packages/pip/_vendor/packaging/_parser.py", line 80, in _parse_requirement
        url, specifier, marker = _parse_requirement_details(tokenizer)
      File "/root/exit/envs/streamspeech/lib/python3.10/site-packages/pip/_vendor/packaging/_parser.py", line 118, in _parse_requirement_details
        specifier = _parse_specifier(tokenizer)
      File "/root/exit/envs/streamspeech/lib/python3.10/site-packages/pip/_vendor/packaging/_parser.py", line 214, in _parse_specifier
        parsed_specifiers = _parse_version_many(tokenizer)
      File "/root/exit/envs/streamspeech/lib/python3.10/site-packages/pip/_vendor/packaging/_parser.py", line 229, in _parse_version_many
        tokenizer.raise_syntax_error(
      File "/root/exit/envs/streamspeech/lib/python3.10/site-packages/pip/_vendor/packaging/_tokenizer.py", line 167, in raise_syntax_error
        raise ParserSyntaxError(
    pip._vendor.packaging._tokenizer.ParserSyntaxError: .* suffix can only be used with `==` or `!=` operators
        PyYAML (>=5.1.*)
                ~~~~~~^
    
    The above exception was the direct cause of the following exception:
    
    Traceback (most recent call last):
      File "/root/exit/envs/streamspeech/lib/python3.10/site-packages/pip/_internal/cli/base_command.py", line 179, in exc_logging_wrapper
        status = run_func(*args)
      File "/root/exit/envs/streamspeech/lib/python3.10/site-packages/pip/_internal/cli/req_command.py", line 67, in wrapper
        return func(self, options, args)
      File "/root/exit/envs/streamspeech/lib/python3.10/site-packages/pip/_internal/commands/install.py", line 377, in run
        requirement_set = resolver.resolve(
      File "/root/exit/envs/streamspeech/lib/python3.10/site-packages/pip/_internal/resolution/resolvelib/resolver.py", line 95, in resolve
        result = self._result = resolver.resolve(
      File "/root/exit/envs/streamspeech/lib/python3.10/site-packages/pip/_vendor/resolvelib/resolvers.py", line 546, in resolve
        state = resolution.resolve(requirements, max_rounds=max_rounds)
      File "/root/exit/envs/streamspeech/lib/python3.10/site-packages/pip/_vendor/resolvelib/resolvers.py", line 427, in resolve
        failure_causes = self._attempt_to_pin_criterion(name)
      File "/root/exit/envs/streamspeech/lib/python3.10/site-packages/pip/_vendor/resolvelib/resolvers.py", line 239, in _attempt_to_pin_criterion
        criteria = self._get_updated_criteria(candidate)
      File "/root/exit/envs/streamspeech/lib/python3.10/site-packages/pip/_internal/resolution/resolvelib/provider.py", line 247, in get_dependencies
        return [r for r in candidate.iter_dependencies(with_requires) if r is not None]
      File "/root/exit/envs/streamspeech/lib/python3.10/site-packages/pip/_internal/resolution/resolvelib/provider.py", line 247, in <listcomp>
        return [r for r in candidate.iter_dependencies(with_requires) if r is not None]
      File "/root/exit/envs/streamspeech/lib/python3.10/site-packages/pip/_internal/resolution/resolvelib/candidates.py", line 401, in iter_dependencies
        for r in self.dist.iter_dependencies():
      File "/root/exit/envs/streamspeech/lib/python3.10/site-packages/pip/_internal/metadata/pkg_resources.py", line 247, in iter_dependencies
        return self._dist.requires(extras)
      File "/root/exit/envs/streamspeech/lib/python3.10/site-packages/pip/_vendor/pkg_resources/__init__.py", line 2786, in requires
        dm = self._dep_map
      File "/root/exit/envs/streamspeech/lib/python3.10/site-packages/pip/_vendor/pkg_resources/__init__.py", line 3072, in _dep_map
        self.__dep_map = self._compute_dependencies()
      File "/root/exit/envs/streamspeech/lib/python3.10/site-packages/pip/_vendor/pkg_resources/__init__.py", line 3082, in _compute_dependencies
        reqs.extend(parse_requirements(req))
      File "/root/exit/envs/streamspeech/lib/python3.10/site-packages/pip/_vendor/pkg_resources/__init__.py", line 3135, in __init__
        super().__init__(requirement_string)
      File "/root/exit/envs/streamspeech/lib/python3.10/site-packages/pip/_vendor/packaging/requirements.py", line 38, in __init__
        raise InvalidRequirement(str(e)) from e
    pip._vendor.packaging.requirements.InvalidRequirement: .* suffix can only be used with `==` or `!=` operators
        PyYAML (>=5.1.*)
                ~~~~~~^
    

Expected Behavior:
fairseq should be installed without errors.

Additional Information:

  • PyYAML version: 6.0.1
  • omegaconf version: 2.0.5
  • hydra-core version: 1.0.7
  • torch version: 2.3.1
  • pip version: 23.1

Notes:

  • The PyYAML dependency in omegaconf has an invalid metadata format that causes the error.
  • Running as root might cause permissions and conflicts, but switching to a non-root user does not resolve the dependency issues.

Steps Tried:

  1. Upgrading/downgrading pip, setuptools, and wheel.
  2. Installing specific versions of PyYAML and omegaconf.
  3. Setting up a new conda environment with specific versions of dependencies.
  4. Verifying GCC installation and setting default versions.

Request for Help:
I need assistance in resolving this installation failure. Any guidance or solutions would be appreciated.

Contact Information:
[email protected]

Labels

  • needs triage
  • bug

Low ASR-Bleu result with simul-s2st.sh

Hi, I ran the preprocess script and the bash script on the es-en dataset following the instructions given using the pretrained model weights on huggingface. However, the resulting ASR-Bleu score was only 9.1. What might be the reason for this inaccuracy?

Make sure Hugging Face download stats work, better discoverability

Dear authors,

Thanks for this nice work! I saw the checkpoints are already pushed to the 🤗 hub which is great: https://huggingface.co/ICTNLP/StreamSpeech_Models/tree/main, however there are a few things that could be improved which will help in making more people discover your models.

To make download stats work for your models, there are a few options.

  • in case your models are regular nn.Module classes, one can leverage the PyTorchModelHubMixin which automatically adds push_to_hub and from_pretrained to your custom PyTorch models, ensuring download stats will work. This also uses safetensors by default rather than pickle to store weights, which is considered safer.
  • alternatively, you can also follow this guide to make download stats work: https://huggingface.co/docs/hub/models-download-stats. This allows you to specify a file extension (like *.pt) to track downloads.
  • lastly, we also offer some utility methods which allow to load files using a single line of code! e.g.

Usage is as follows:

from huggingface_hub import hf_hub_download
import torch

filepath = hf_hub_download(repo_id="ICTNLP/StreamSpeech_Models", filename="streamspeech.offline.de-en.pt", repo_type="model")
state_dict = torch.load(filepath, map_location="cpu")

We also offer upload_file, upload_folder for pushing to the hub.

Let me know if you need any help!

Kind regards,

Niels
ML Engineer @ HF 🤗

Unable to run test script on local machine

Hi Mr. Zhang,

I was trying to recreate the results on my local machine, but I encountered the following error:

Traceback (most recent call last):
File "/Users/ifrit/Library/Python/3.9/bin/simuleval", line 33, in
sys.exit(load_entry_point('simuleval', 'console_scripts', 'simuleval')())
File "<my streamspeech directory>/SimulEval/simuleval/cli.py", line 47, in main
system, args = build_system_args()
File "<my streamspeech directory>/SimulEval/simuleval/utils/agent.py", line 131, in build_system_args
system = system_class.from_args(args)
File "<my streamspeech directory>/SimulEval/simuleval/agents/agent.py", line 161, in from_args
return cls(args)
File "<my streamspeech directory>/agent/speech_to_speech.streamspeech.agent.py", line 117, in init
self.load_model_vocab(args)
File "<my streamspeech directory>/agent/speech_to_speech.streamspeech.agent.py", line 356, in load_model_vocab
utils.import_user_module(state["cfg"].common)
File "<my streamspeech directory>/fairseq/fairseq/utils.py", line 481, in import_user_module
raise FileNotFoundError(module_path)
FileNotFoundError: /data/zhangshaolei/SimulS2S/reasearchs/ctc_unity

I changed all the config files as necessary, but this issue remains. I looked into the code a bit, and I believe it's due to the state of the loaded checkpoint file:

> print(state['cfg'].common.user_dir)
- /data/zhangshaolei/SimulS2S/reasearchs/ctc_unity

My solution is to add a line changing the user_dir before the import_user_module call in line 356 in speech_to_speech.streamspeech.agent.py, and it's working fine for me now.

    state = checkpoint_utils.load_checkpoint_to_cpu(filename)
    state["cfg"].common.user_dir = 'reasearchs/ctc_unity/'
    utils.import_user_module(state["cfg"].common)

I'm not sure if it's due to me missing any config changes, but I think it's unlikely as it seems written into the pretrained weights. I've only tested Simultaneous S2ST on fr-en, but I guess similar errors will happen in other scenarios.

Thank you so much!

some folders and files are missing

Traceback (most recent call last):
File "/home/StreamSpeech/demo/app.py", line 26, in
from examples.speech_to_text.data_utils import extract_fbank_features
ModuleNotFoundError: No module named 'examples.speech_to_text'

Followed step by step, while running the app.py got this error.

Trained model can generate correct text but incorrect speech

I tried to reproduce the training of the fr-en simultaneous model. I follows the instruction to prepare the dataset and run the script train.simul-s2st.sh
The model training seems to go fine but the during evaluation of our trained model (using ./simuleval.simul-s2st.sh), weird behaviors happen.
Here is the training logging:
Screenshot 2024-07-27 at 2 39 39 AM
During the inference, when I tried to run the eval scripts on the example you provided, the weird thing happens, it can output correct text translation but the output speech is incorrect (output speech is almost silent). I print the text output and speech units output as follow:
image

Do you know what problem may be?

Thank you

Error when loading speech_to_speech_ctc task

Description

When running the simuleval command with the speech_to_speech.streamspeech agent, I encountered the following error:

Traceback (most recent call last):
File "/Users/arararz/anaconda3/envs/streamspeech/bin/simuleval", line 33, in
sys.exit(load_entry_point('simuleval', 'console_scripts', 'simuleval')())
File "/Users/arararz/Documents/GitHub/StreamSpeech/SimulEval/simuleval/cli.py", line 47, in main
system, args = build_system_args()
File "/Users/arararz/Documents/GitHub/StreamSpeech/SimulEval/simuleval/utils/agent.py", line 131, in build_system_args
system = system_class.from_args(args)
File "/Users/arararz/Documents/GitHub/StreamSpeech/SimulEval/simuleval/agents/agent.py", line 161, in from_args
return cls(args)
File "/Users/arararz/Documents/GitHub/StreamSpeech/agent/speech_to_speech.streamspeech.agent.py", line 117, in init
self.load_model_vocab(args)
File "/Users/arararz/Documents/GitHub/StreamSpeech/agent/speech_to_speech.streamspeech.agent.py", line 382, in load_model_vocab
task = tasks.setup_task(task_args)
File "/Users/arararz/Documents/GitHub/StreamSpeech/fairseq/fairseq/tasks/init.py", line 31, in setup_task
task = TASK_REGISTRY[task_name]
KeyError: 'speech_to_speech_ctc'

The error seems to be related to the speech_to_speech_ctc task not being found in the task registry.

Steps to Reproduce

  1. Set up the StreamSpeech environment
  2. Run the simultaneous s2st script provided in the readme

Environment

  • Operating System: macOS (M2 Max)
  • Python Version: 3.10.14

Train on other language

Hello, this is amazing.
I want to ask is it can be trained in other languages, or even if can be trained in multiple languages ​​at the same time.

RuntimeError: Input tensor has to be 2D. - When using Web GUI demo with own audio(.mp3)

INFO:werkzeug:127.0.0.1 - - [09/Aug/2024 14:15:00] "POST /upload HTTP/1.1" 200 -
INFO:werkzeug:127.0.0.1 - - [09/Aug/2024 14:15:00] "GET /uploads/testing2.MP3?latency=320 HTTP/1.1" 500 -
Traceback (most recent call last):
File "/home/zheng/anaconda3/envs/streamspeech/lib/python3.10/site-packages/flask/app.py", line 1498, in call
return self.wsgi_app(environ, start_response)
File "/home/zheng/anaconda3/envs/streamspeech/lib/python3.10/site-packages/flask/app.py", line 1476, in wsgi_app
response = self.handle_exception(e)
File "/home/zheng/anaconda3/envs/streamspeech/lib/python3.10/site-packages/flask/app.py", line 1473, in wsgi_app
response = self.full_dispatch_request()
File "/home/zheng/anaconda3/envs/streamspeech/lib/python3.10/site-packages/flask/app.py", line 882, in full_dispatch_request
rv = self.handle_user_exception(e)
File "/home/zheng/anaconda3/envs/streamspeech/lib/python3.10/site-packages/flask/app.py", line 880, in full_dispatch_request
rv = self.dispatch_request()
File "/home/zheng/anaconda3/envs/streamspeech/lib/python3.10/site-packages/flask/app.py", line 865, in dispatch_request
return self.ensure_sync(self.view_functions[rule.endpoint])(**view_args) # type: ignore[no-any-return]
File "/home/zheng/fuchengzheng/steamspeech/StreamSpeech/demo/app.py", line 909, in uploaded_file
run(path)
File "/home/zheng/fuchengzheng/steamspeech/StreamSpeech/demo/app.py", line 836, in run
action=agent.policy()
File "/home/zheng/anaconda3/envs/streamspeech/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
File "/home/zheng/fuchengzheng/steamspeech/StreamSpeech/demo/app.py", line 468, in policy
feature = self.feature_extractor(self.states.source)
File "/home/zheng/fuchengzheng/steamspeech/StreamSpeech/demo/app.py", line 100, in call
waveform, sample_rate = convert_waveform(
File "/home/zheng/fuchengzheng/steamspeech/StreamSpeech/fairseq/fairseq/data/audio/audio_utils.py", line 60, in convert_waveform
converted, converted_sample_rate = ta_sox.apply_effects_tensor(
File "/home/zheng/anaconda3/envs/streamspeech/lib/python3.10/site-packages/torchaudio/sox_effects/sox_effects.py", line 156, in apply_effects_tensor
return sox_ext.apply_effects_tensor(tensor, sample_rate, effects, channels_first)
File "/home/zheng/anaconda3/envs/streamspeech/lib/python3.10/site-packages/torch/ops.py", line 1061, in call
return self
._op(*args, **(kwargs or {}))
RuntimeError: Input tensor has to be 2D.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.