microsoft / ms-snsd Goto Github PK

The Microsoft Scalable Noisy Speech Dataset (MS-SNSD) is a noisy speech dataset that can scale to arbitrary sizes depending on the number of speakers, noise types, and Speech to Noise Ratio (SNR) levels desired.

License: MIT License

Python 15.33% HTML 84.67%

ms-snsd's People

Stargazers

Watchers

Forkers

sfilipi ebbeyram ashaazami orangejui chandanka90 melspectrum007 zhaoforever colinaudio43 xingws chenglongbiao ishine bryent111 gateslu librence xinkez farjoubin syljoy yongyug dltaixlt kantic fatimatasnim yuzhongshanyue javiomotero andreasfragner atulkumarrai spxnn group12i simpleishappy bob-hu zehuangfang bhaskers-blu-org2 cp917 tjadamlee hiyoung-asr taffywrinkle alibabasglab wyn314 claudiusgonzo ma7555 themockingjester windstudent chenglinjuan wenwanchen phdodds joshijai2 nasos-anagnostou adrija-debug yaoyao20050321 luluwangwang1989 jiangyu94 rxhmdia jihwanparkpreprocessing maggie0830 anand0427 mohallel jixinintelligence sakhprace gurugubelllik normonisping mingmchen onejune2018 zhongshijun noise-suppression seniorglassmaster okrio mtxing georgejerzy kchemorion proling1994 ui-richard vanessa108 scofir yexiayin road2018 speech998 irfaniqbal7577 miblue119 drumpt baekms 0000-1 xj-martin wjliu0215 amallik2 jordirbmed hiddefolkertsma bigsealing jassam jd07 xujieuse virajkarandikar yunyangzeng shenhark fragrantrookie aapocketz dddis yifanwang983 alanliudx yijxiang kshitij-wahi kaitaitong

ms-snsd's Issues

original 44.1k or 48k dataset

Could you please provide the original 44.1k or 48k dataset for speech and noises? Thank you very much.

Silence Removal Idea

Maybe a silence removal option could be added to be able to develop robust voice activity detection models. pyAudioAnalysis could be integrated for such purpose.

Same configuration as the reference paper

Hello,
We are looking for the configuration that was used for 'A scalable noisy speech dataset and online subjective test framework' paper? We could not find all the values such as SNR levels, noise types etc.
Maybe add it to the repo so that its really easy to reproduce the same setup.
Thank you in advance!

setting the sample rate doesn't work

fs gets overwritten and always stays 16K

total_hours results in wrong total hours

setting total_hours to be 1 hour, only produces 10 mins
setting total_hours to be 10 hourr, only produces 100 mins

noisyspeech_synthesizer.py always slices from the start of the noise array

In noisyspeech_synthesizer.py, an array of audio samples are read from a noise file (line 78). On line 81, a slice of the noise array is taken from index 0 to len(clean) as:

noise = noise[0:len(clean)]

By always starting at index 0, in the case where the clean speech arrays are roughly the same length (~16000 samples) as in the speech commands case, it means that the number of unique noise arrays we see is equal to the number of noise files.

Even if we have one noise file with 10 hours of audio, we may only ever make use of the first 1 second of this data.

It would be better to pick a random starting index within the noise array from which to take a slice. For example
start_idx = np.random.randint(low=0, high=len(noise)-len(clean), size=1)
noise = noise[start_idx : start_idx+len(clean)]

noisyspeech_synthesizer.py fails to run with numpy 1.18.5

There seems to be a breaking issue while running the noisyspeech_synthesizer.py while running it on Google Colab.
Colab has numpy 1.18.5 at the time this issue was posted. This version is installed by default when connecting to a runtime.
On following the standard procedure and prerequisites for running this scipt, it gives the following error.

Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/numpy/core/function_base.py", line 117, in linspace
    num = operator.index(num)
TypeError: 'float' object cannot be interpreted as an integer

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "./MS-SNSD/noisyspeech_synthesizer.py", line 124, in <module>
    main(cfg._sections[args.cfg_str])
  File "./MS-SNSD/noisyspeech_synthesizer.py", line 47, in main
    SNR = np.linspace(snr_lower, snr_upper, total_snrlevels)
  File "<__array_function__ internals>", line 6, in linspace
  File "/usr/local/lib/python3.6/dist-packages/numpy/core/function_base.py", line 121, in linspace
    .format(type(num)))
TypeError: object of type <class 'float'> cannot be safely interpreted as an integer.

However, when the numpy version is downgraded from 1.18.5 to 1.16.4, the script works perfectly fine as it is supposed to.

Real time noise suppression

Excellent article on VentureBeat today:
https://venturebeat.com/2020/04/09/microsoft-teams-ai-machine-learning-real-time-noise-suppression-typing/

Funny enough I've used this dataset (which I'm assuming you are referring to in the article) to also train noise suppression. I didn't have a requirement for real-time/streaming so I used a bidirectional LSTM recurrent layer. I also trained against Librispeech (technically LibriTTS as I wanted 24hz audio.)

Examples

Sourced from national news broadcasts to show performance against data it was NOT trained on. Audio files are compressed as GitHub doesn't allow raw waveform upload. I've provided the source files from the broadcast with _noisy.wav suffix and the predicted output from the network with the _clean.wav suffix.

Example 6

Not the best but still did a decent job suppressing a noise sample it was never trained against.

trump_helicopter.zip

Different audio Lengths cause error broadcasting

if the passed arguments (clean&noise) were with different lengths, it would lead to the following error :
ValueError: operands could not be broadcast together with shapes

it could be solved by increasing the shorter length to be equal with the greater one like this

def snr_mixer(clean, noise, snr):
    clean_len = len(clean)
    noise_len = len(noise)
    if clean_len < noise_len:
        rep_time = int(np.floor(noise_len / audio_len))
        left_len = noise_len - clean_len * rep_time
        tmp = np.tile(clean, [1, rep_time])
        tmp.shape = (tmp.shape[1], )
        clean = np.hstack((tmp, clean[:left_len]))
        noise = np.array(noise)

    else:
        rep_time = int(np.floor(clean_len / noise_len))
        left_len = clean_len - noise_len * rep_time
        tmp = np.tile(noise, [1, rep_time])
        tmp.shape = (tmp.shape[1], )
        noise = np.hstack((tmp, noise[:left_len]))
        clean = np.array(clean)
    
    # Normalizing to -25 dB FS
    rmsclean = (clean**2).mean()**0.5
    scalarclean = 10 ** (-25 / 20) / rmsclean
    clean = clean * scalarclean
    rmsclean = (clean**2).mean()**0.5

    rmsnoise = (noise**2).mean()**0.5
    scalarnoise = 10 ** (-25 / 20) /rmsnoise
    noise = noise * scalarnoise
    rmsnoise = (noise**2).mean()**0.5
    
    # Set the noise level for a given SNR
    noisescalar = np.sqrt(rmsclean / (10**(snr/20)) / rmsnoise)
    noisenewlevel = noise * noisescalar
    noisyspeech = clean + noisenewlevel
    return clean, noisenewlevel, noisyspeech

clean, noisenewlevel, noisyspeech = snr_mixer(audio_org, noise_org, 2)

'noisescalar' derivation in clean speech and noise mix

Hi,

Thanks for sharing this open-source dataset. I am trying to apply this code to generate synthetic noisy datasets for speech processing. In my practice, I observed that the code-generated data has only half of SNR than the code nominated, which I tested from Audacity. After further checked the 'audiolib.py', I think the 'noisescalar' derivation (line 68) seems to be incorrect.

In the 'audiolib.py' code, the original code is:
noisescalar = np.sqrt(rmsclean / (10(snr/20)) / rmsnoise)**

Where I think the square root shall not be used for the noise scalar since the SNR is calculated based on RMS in the derivation, and it shall be corrected as below in the scaling of the noise level.
noisescalar = rmsclean / (10(snr/20)) / rmsnoise**

In my test, I got the synthetic noisy data with the correct SNR level after this correction. So could you please correct it in the code?

How do I generate audio data with fixed audio length?

Is there a way to generate audio data with fixed audio length? I know there's audio_length that specifies the minimum length of each audio clip, but is there a way to specify the maximum audio length?

Masking-based methods

Hello, if I want to use a mask-based approach for speech enhancement, e.g. IBM, IRM, etc.

how should I use this dataset?

Enhanced results of RNNoise

We are trying to replicate the result published for improved RNNoise as per below paper. Is the fork of that rnnoise available to test?

A scalable noisy speech dataset and online subjective test framework
Chandan K. A. Reddy1
, Ebrahim Beyrami1
, Jamie Pool1
, Ross Cutler1
, Sriram Srinivasan1

https://arxiv.org/pdf/1909.08050&ved=2ahUKEwjMr6LD3aPqAhXtYd8KHZ7ECl0QFjAAegQIBRAC&usg=AOvVaw01MCa2LbkZ3KXTi21FHjsq

some bugs in audiolib

I think in audiolib file, line 66 code: noisescalar = np.sqrt (rmsclean / (10 ** (snr / 20)) / rmsnoise),
snr should be divided by 10, not 20.

Maybe something wrong in audiolib.py

For the function def snr_mixer(clean, noise, snr) in the audiolib.py file, I think there is something wrong.
First, line 66 code: Function np.sqrt( ) may be unnecessary.
Second, as clean and noise have been normalized to -25 dBFS, noisescalar may not need to be calculated using rmsclean and rmsnoise.

microsoft / ms-snsd Goto Github PK

ms-snsd's People

Stargazers

Watchers

Forkers

ms-snsd's Issues

Examples

Example 1

Example 2

Example 3

Example 4

Example 5

Example 6

Recommend Projects

Recommend Topics

Recommend Org