bytedance / salmonn Goto Github PK

SALMONN: Speech Audio Language Music Open Neural Network

Home Page: https://bytedance.github.io/SALMONN/

License: Apache License 2.0

HTML 0.21% Python 99.79%

audio audio-processing large-language-models multi-modal speech speech-recognition bytedance tsinghua-university music iclr2024

salmonn's Introduction

SALMONN: Speech Audio Language Music Open Neural Network

🚀🚀 Welcome to the repo of SALMONN!

SALMONN is a large language model (LLM) enabling speech, audio events, and music inputs, which is developed by the Department of Electronic Engineering at Tsinghua University and ByteDance. Instead of speech-only input or audio-event-only input, SALMONN can perceive and understand all kinds of audio inputs and therefore obtain emerging capabilities such as multilingual speech recognition and translation and audio-speech co-reasoning. This can be regarded as giving the LLM "ears" and cognitive hearing abilities, which makes SALMONN a step towards hearing-enabled artificial general intelligence.

🔥 News

[2023-10-08] ✨ We have released the model checkpoint and the inference code for SALMONN-13B!
[2023-11-13] 🎁 We have released a 7B version of SALMONN at tsinghua-ee/SALMONN-7B and built the 7B demo here!
[2024-01-16] 💖 Our paper was accepted by ICLR 2024!
[2024-04-07] 🤖 We have released all the codes you need to train your own SALMONN! Try some cool things!

🌟 Structure

The model architecture of SALMONN is shown below. A window-level Q-Former is used as the connection module to fuse the outputs from a Whisper speech encoder and a BEATs audio encoder as augmented audio tokens, which are aligned with the LLM input space. The LoRA adaptor aligns the augmented LLM input space with its output space. The text prompt is used to instruct SALMONN to answer open-ended questions about the general audio inputs and the answers are in the LLM text responses.

⚡️ Demos

Compared with traditional speech and audio processing tasks such as speech recognition and audio caption, SALMONN leverages the general knowledge and cognitive abilities of the LLM to achieve a cognitively oriented audio perception, which dramatically improves the versatility of the model and the richness of the task. In addition, SALMONN is able to follow textual commands and even spoken commands with a relatively high degree of accuracy. Since SALMONN only uses training data based on textual commands, listening to spoken commands is also a cross-modal emergent ability.

Here are some examples of SALMONN.

Audio	Response
gunshots.wav
duck.wav
music.wav

🌈 How to train a model

For SALMONN-13B v1, you need to use the following dependencies:

Our environment: The python version is 3.9.17, and other required packages can be installed with the following command: pip install -r requirements.txt.
Download whisper large v2 to whisper_path.
Download Fine-tuned BEATs_iter3+ (AS2M) (cpt2) to beats_path.
Download vicuna 13B v1.1 to llama_path.
Running with python3 train.py --cfg-path configs/config.yaml in A100-SXM-80GB.

🌈 How to inference in CLI

Same as How to train a model: 1-4.
Download salmonn v1 to ckpt.
Running with python3 cli_inference.py --cfg-path configs/decode_config.yaml in A100-SXM-80GB. Now you can input wav_path and prompt. Enjoy yourself !

🌈 How to launch a web demo

Same as How to train a model: 1-4.
Download salmonn v1 to ckpt.
Running with python3 web_demo.py --cfg-path configs/decode_config.yaml in A100-SXM-80GB.

👀 Team

Team Tsinghua: Wenyi Yu, Changli Tang, Guangzhi Sun, Chao Zhang

Team ByteDance: Xianzhao Chen, Wei Li, Tian Tan, Lu Lu, Zejun Ma

✨ Citation

If you find SALMONN useful, please cite our paper:

@inproceedings{
  tang2024salmonn,
  title={{SALMONN}: Towards Generic Hearing Abilities for Large Language Models},
  author={Changli Tang and Wenyi Yu and Guangzhi Sun and Xianzhao Chen and Tian Tan and Wei Li and Lu Lu and Zejun MA and Chao Zhang},
  booktitle={The Twelfth International Conference on Learning Representations},
  year={2024},
  url={https://openreview.net/forum?id=14rn7HpKVk}
}

salmonn's People

Contributors

Stargazers

Watchers

salmonn's Issues

when can we finetune the model with vicuna7B?

Where is the source code, only a bunch of raw audio files?

Hello where is the source for Salmon

Adding Contributors section to the readme.md !

Why Contributors section:- A "Contributors" section in a repo gives credit to and acknowledges
the people who have helped with the project, fosters a sense of community, and helps others
know who to contact for questions or issues related to the project.

Issue type

[✅] Docs

@TCL606 kindly assign this issue to me ! I would love to work on it ! Thank you !

Adding Code of conduct to repo !

code-of-conduct:- We propose adding a comprehensive Code of Conduct to our repository to ensure
a safe, respectful, and inclusive environment for all contributors and users. This code will
serve as a guideline for behavior, promoting diversity, reducing conflicts, and attracting a
wider range of perspectives.

Issue type

[✅] Docs

@ kindly assign this issue to me ! I would love to work on it !

How to adopt a speaker verification task?

Nice work! But I'm wondering that how can SALMONN adopt the speaker verification task? What is the prmpt, input and output?

Any more information about audio text aligner

Is there any more information about how the audio text aligner is implemented? As they have different length, hard to image how they could be trained into the same embedding space.

Thanks.

Typo error in README.md

In the "How to inference in CLI" section, there is a typo in the word "requried." It should be "required." Here's the corrected sentence:

Original: "Our environment: The python version is 3.9.17, and other requried packages can be installed with the following command..."

Corrected: "Our environment: The python version is 3.9.17, and other required packages can be installed with the following command..."

the instruction tuning stage

I'm a little confused about the paper.

It is considered that task over-fitting is caused by the instruction tuning stages. Why can't we directly use the model to do zero-shot task after the pre-training stage? In other words, what is the benefit of the instruction tuning stage for the model to do zero-shot task?
In Figure 3, the accuracy and F1 score of SQQA are basically the same when lora scaling=0 and lora scaling=2. Is this phenomenon shows that Q-Former's cross-modal ability in the first step can solve this task?

使用 7B 模型，有的时候无法生成 audio caption

prompt = 'Please describe the audio.'
    prompt = [
        cfg.config.model.prompt_template.format("<Speech><SpeechHere></Speech> " + prompt.strip())
    ]

如果是
prompt = 'Please write down what your hear in the audio.'
则全都无法生成

The role of prompt_pattern parameter

I noticed that there is this parameter in your code: prompt_pattern. For music, do I need to modify it? Can you briefly talk about the process of training this model and the data set used?

Some questions about this project hoping for your further answers.

Thank you very much for being able to open source your complete code of SALMONN.
Although you've given a clear presentation in both your paper and code, there are still some points make me puzzled after reading your paper and code:

What's the difference of your training settings between stage1 and stage2 on AST task?
The paper says you use LibriSpeech-960h (280k) and GigaSpeech M-set at stage1 and afterwards use the same LibriSpeech-960h (also 280k) at stage2, so what's the changes about the training setting on LibriSpeech dataset from stage1 to stage2? Did you train without any instruction during stage1, or just change the used instructions?
How to get the 200k samples of GigaSpeech used at stage2?
I notice the GigaSpeech used at stage2 is 200k, nearing the number of GigaSpeech S-set (220k), and it seems that you used all the GigaSpeech M-set (680k) during stage1 according to the paper, so what's exactly the 200k samples of GigaSpeech at stage2? Were they randomly selected from GigaSpeech M-set?
Will the performance on downstream tasks not be influenced by so many preset instructions for instruction tuning?
According to the code recently released, there are many instructions setting for a single downstream task (for instance, there are 15 instructions setting for ASR task). From my point of view, one problem hard to avoid is that some instructions for different downstream tasks may present similar patterns, and these similarities have the potential to mislead the model to another unexpected task during inference, especially with a lower beam setting. So I want to know whether more instructions for tuning is better or less is better in your opinion or during your experiments because I'm uncertain which case may prevent this kind of similarity better.

I failed to find any information about these problems both in paper and code so I'm looking forward to your further answers.
Thank you again for taking time to read my issue. Hoping for your early reply!

This XML file does not appear to have any style information associated with it. The document tree is shown below.

对中文电话录音识别的支持好像不太行？

识别的内容无法停止。一直在重复一句话

这是一段电话对话，有两个人在谈话。

第一个人说：“你好，是吗？”

第二个人回答：“你好，有什么需要吗？”

第一个人说：“我想问一下你的价格是多少？”

第二个人回答：“我们的价格是三百六十美元。”

第一个人说：“啊，太贵了。那么价格是多少？”

第二个人回答：“我们的价格是三百六十美元。”

第一个人说：“啊，太贵了。那么价格是多少？”

第二个人回答：“我们的价格是三百六十美元。”

第一个人说：“啊，太贵了。那么价格是多少？”

第二个人回答：“我们的价格是三百六十美元。”

第一个人说：“啊，太贵了。那么价格是多少？”

第二个人回答：“我们的价格是三百六十美元。”

bytedance / salmonn Goto Github PK

salmonn's Introduction

SALMONN: Speech Audio Language Music Open Neural Network

🔥 News

🌟 Structure

⚡️ Demos

🌈 How to train a model

🌈 How to inference in CLI

🌈 How to launch a web demo

👀 Team

✨ Citation

salmonn's People

Contributors

Stargazers

Watchers

Forkers

salmonn's Issues

Recommend Projects

Recommend Topics

Recommend Org