xiph / rnnoise Goto Github PK
View Code? Open in Web Editor NEWRecurrent neural network for audio noise reduction
License: BSD 3-Clause "New" or "Revised" License
Recurrent neural network for audio noise reduction
License: BSD 3-Clause "New" or "Revised" License
RNNoise is a noise suppression library based on a recurrent neural network. A description of the algorithm is provided in the following paper: J.-M. Valin, A Hybrid DSP/Deep Learning Approach to Real-Time Full-Band Speech Enhancement, Proceedings of IEEE Multimedia Signal Processing (MMSP) Workshop, arXiv:1709.08243, 2018. https://arxiv.org/pdf/1709.08243.pdf An interactive demo is available at: https://jmvalin.ca/demo/rnnoise/ To compile, just type: % ./autogen.sh % ./configure % make Optionally: % make install It is recommended to either set -march= in the CFLAGS to an architecture with AVX2 support or to add --enable-x86-rtcd to the configure script so that AVX2 (or SSE4.1) can at least be used as an option. Note that the autogen.sh script will automatically download the model files from the Xiph.Org servers, since those are too large to put in Git. While it is meant to be used as a library, a simple command-line tool is provided as an example. It operates on RAW 16-bit (machine endian) mono PCM files sampled at 48 kHz. It can be used as: % ./examples/rnnoise_demo <noisy speech> <output denoised> The output is also a 16-bit raw PCM file. NOTE AGAIN, THE INPUT and OUTPUT ARE IN RAW FORMAT, NOT WAV. The latest version of the source is available from https://gitlab.xiph.org/xiph/rnnoise . The GitHub repository is a convenience copy. == Training == The models distributed with RNNoise are now trained using only the publicly available datasets listed below and using the training precedure described here. Exact results will still depend on the the exact mix us data used, on how long the training is performed and on the various random seeds involved. To train an RNNoise model, you need both clean speech data, and noise data. Both need to be sampled at 48 kHz, in 16-bit PCM format (machine endian). Clean speech data can be obtained from the datasets listed in the datasets.txt file, or by downloaded the already-concatenation of those files in https://media.xiph.org/rnnoise/data/tts_speech_48k.sw For noise data, we suggest concatenating the 48 kHz noise data from DEMAND at https://zenodo.org/records/1227121 with contrib_noise.sw and synthetic_noise.sw noise files from https://media.xiph.org/rnnoise/data/ To balance out the data, we recommend using multiple (e.g. 5) copies of the contrib_noise.sw and synthetic_noise.sw noise files. The first step is to take the speech and noise, and mix them in a variety of ways to simulate real life conditions (including pauses, filtering and more). Assuming the files are called speech.pcm and noise.pcm, start by generating the training feature data with: % ./dump_features speech.pcm noise.pcm features.f32 <count> where <count> is the number of sequences to process. The number of sequences should be at least 10000, but the more the better (200000 or more is recommended). Optionally, training can also simulate reverberation, in which case room impulse responses (RIR) are also needed. Limited RIR data is available at: https://media.xiph.org/rnnoise/data/measured_rirs-v2.tar.gz The format for those is raw 32-bit floating-point (files are little endian). Assuming a list of all the RIR files is contained in a rir_list.txt file, the training feature data can be generated with: % ./dump_features -rir_list rir_list.txt speech.pcm noise.pcm features.f32 <count> To make the feature generation faster, you can use the script provided in script/dump_features_parallel.sh (you will need to modify the script if you want to add RIR augmentation). To use it: % script/dump_features_parallel.sh ./dump_features speech.pcm noise.pcm features.f32 <count> <nb_processes> which will run nb_processes processes, each for count sequences, and concatenate the output to a single file. Once the feature file is computed, you can start the training with: % python3 train_rnnoise.py features.f32 output_directory Choose a number of epochs (using --epochs) that leads to about 75000 weight updates. The training will produce .pth files, e.g. rnnoise_50.pth . The next step is to convert the model to C files using: % python3 dump_rnnoise_weights.py --quantize rnnoise_50.pth rnnoise_c which will produce the rnnoise_data.c and rnnoise_data.h files in the rnnoise_c directory. Copy these files to src/ and then build RNNoise using the instructions above. For slightly better results, a trained model can be used to remove any noise from the "clean" training speech, before restaring the denoising process again (no need to do that more than once). == Loadable Models == The model format has changed since v0.1.1. Models now use a binary "machine endian" format. To output a model in that format, build RNNoise with that model and use the dump_weights_blob executable to output a weights_blob.bin binary file. That file can then be used with the rnnoise_model_from_file() API call. Note that the model object MUST NOT be deleted while the RNNoise state is active and the file MUST NOT be closed. To avoid including the default model in the build (e.g. to reduce download size) and rely only on model loading, add -DUSE_WEIGHTS_FILE to the CFLAGS. To be able to load different models, the model size (and header file) needs to patch the size use during build. Otherwise the model will not load We provide a "little" model with half as an alternative. To use the smaller model, rename rnnoise_data_little.c to rnnoise_data.c. It is possible to build both the regular and little binary weights and load any of them at run time since the little model has the same size as the regular one (except for the increased sparsity).
I might be wrong, but there could be a bug in the pitch pseudo-interpolation.
Luckily, it's not critical, but better solve if it's an actual bug.
At the end of pitch_search()
:
if (best_pitch[0]>0 && best_pitch[0]<(max_pitch>>1)-1)
{
opus_val32 a, b, c;
a = xcorr[best_pitch[0]-1];
b = xcorr[best_pitch[0]];
c = xcorr[best_pitch[0]+1];
if ((c-a) > MULT16_32_Q15(QCONST16(.7f,15),b-a))
offset = 1;
else if ((a-c) > MULT16_32_Q15(QCONST16(.7f,15),b-c))
offset = -1;
else
offset = 0;
} else {
offset = 0;
}
*pitch = 2*best_pitch[0]-offset;
I think that the last line should be *pitch = 2*best_pitch[0]+offset;
(plus instead of minus).
For instance, when c-a > .7*(b-a)
is true, it means that xcorr[best_pitch[0]+1]
is the strongest correlation coefficient (a
and b
are almost the same, while c
is greater than a
). Hence, I would return 2*best_pitch[0]+1
.
Note that at the end of remove_doubling()
the same is done, but in fact *T0_ = 2 * T + offset;
is assigned (which has a +). I haven't sent a pull request since this code comes from Opus (and it's used in other projects).
I tried to rnnoise_demo with my data, which is 16 bit monaural format converted from 24 bit monorail format using ffmpeg. Rnnoise succeeded to produce a new data, but I could not play that. What's wrong?
Would you tell me some hint?
Firstly, amazing work with this project!
I found pitch_filter(X, P, Ex, Ep, Exp, g); the g is initial in while() for every frame, I feel it is somewrong
Hi, first of all thank you for your amazing work!
I've been looking at the code and I am interested in better understanding the feature extraction process.
I have some questions, I will do my best to be as detailed as possibe.
Let me start from compute_frame_features()
in denoise.c
.
The first step is computing the energy in the Opus bands (done in frame_analysis()
) and then the pitch is estimated/tracked. pitch_downsample()
performs the 20ms frames downsampling by halving the samples using a [0.25, 0.5, 0.25] kernel around the even samples. Is this a less expensive way to perform downsampling and low-pass filtering jointly to avoid aliasing?
Then _celt_autocorr()
is called with a lag of just 4 samples on the downsampled sequence (hence, 24kHz). Why just 4? What is achieved exactly with such a low lag? And if I look at _celt_autocorr()
, I don't understand why the autocorrelation computed by celt_pitch_xcorr()
is modified afterwards by summing the autocorrelation for different lags.
After that, the autocorrelation is further modified once _celt_autocorr()
returns (below the // Noise floor -40 dB
comment). Why is that done? And finally _celt_lpc()
is called and the LPC coefficients modified (I mean lpc2
) and used to filter the downsampled sequence via celt_fir5()
.
This whole part is a bit obscure to me and it's also hard to understand where some constants come from (e.g., ac[0] *= 1.0001f;
and ac[i] -= ac[i]*(.008f*i)*(.008f*i);
) - I've found some possible mappings between the dB and the linear scale, but I'm not fully sure.
Overall, pitch_downsample()
looks like a pre-processing step before the pitch is sought in pitch_search()
. It would be great if you can share details on what is done.
My apologies if I am asking something that may be obvious to others. I'm to some extent familiar with LPC and auto-correlation, a little bit with pitch tracking. That's probably why I can't grasp all the details in the code.
Cheers,
Alessio
Hi,thanks for your nice work,is it possible to provide a so library with a .h file?in the code,the run file is a sh file
When I try to use the provided python script dump_rnn.py to decode the newweights9i.hdf5 model, I found that it can not work well. So I change a lot of it to make it work well. I am not sure if it is right in my way. Here I want share them to you. If you are not busy in some time, please help me check it. I have try it to decode the model i got. I add below in the begging.
from keras.constraints import Constraint
def mean_squared_sqrt_error(y_true, y_pred):
return K.mean(K.square(K.sqrt(y_pred) - K.sqrt(y_true)), axis=-1)
def my_crossentropy(y_true, y_pred):
return K.mean(2*K.abs(y_true-0.5) * K.binary_crossentropy(y_pred, y_true), axis=-1)
def mymask(y_true):
return K.minimum(y_true+1., 1.)
def msse(y_true, y_pred):
return K.mean(mymask(y_true) * K.square(K.sqrt(y_pred) - K.sqrt(y_true)), axis=-1)
def mycost(y_true, y_pred):
return K.mean(mymask(y_true) * (10K.square(K.square(K.sqrt(y_pred) - K.sqrt(y_true))) + K.square(K.sqrt(y_pred) - K.sqrt(y_true)) + 0.01K.binary_crossentropy(y_pred, y_true)), axis=-1)
def my_accuracy(y_true, y_pred):
return K.mean(2*K.abs(y_true-0.5) * K.equal(y_true, K.round(y_pred)), axis=-1)
class WeightClip(Constraint):
def init(self, c=2,name='WeightClip'):
self.c = c
def __call__(self, p):
#return {'name': self.__class__.__name__, 'c': self.c}
return K.clip(p, -self.c, self.c)
def get_config(self):
return {'name': self.__class__.__name__, 'c': self.c}
add an argument to name = 'WeightClip' init
and change load_model from
model = load_model('./newweights9i.h5', custom_objects={'msse': mean_squared_sqrt_error, 'mean_squared_sqrt_error':mean_squared_sqrt_error, 'my_crossentropy':mean_squared_sqrt_error, 'mycost':mean_squared_sqrt_error, 'WeightClip':foo})
to
model = load_model(sys.argv[1], custom_objects={'msse':msse, 'mean_squared_sqrt_error': mean_squared_sqrt_error, 'my_crossentropy':my_crossentropy, 'mycost':mycost, 'WeightClip':WeightClip})
In the rnn_train.py file there are two arrays that are never used
noise_train = np.copy(all_data[:nb_sequences*window_size, 64:86])
noise_train = np.reshape(noise_train, (nb_sequences, window_size, 22))
what are they? Are they needed somewhere? The training data containing on those arrays were never used at all!
Hi,
Firstly, amazing work with this project! I have verified that it is quite effective at de-noising speech samples out of the box. I have found, however, that feeding certain noisy speech samples through the example program results in de-noised output with the speech signal being clipped. For my use-case, it would be highly favorable to cancel slightly less noise to receive non-clipped output. Is this something that can be easily done, either out-of-box or via modification on my end?
I am also wondering if the clipping is resultant of the samples themselves (which were recorded with a laptop mic) rather than your code, persay.. I can't yet identify what about my samples causes some to get clipped and others to not.
Hi,in the code,the function seems to doing a filtering to the speech and noise,is there some formula to this,and the related parameter -1.99599, 0.99600,-2,1?Thanks.
There currently are quite a few magic numbers in the codebase. I’m referring to this type:
Unique values with unexplained meaning or multiple occurrences which could (preferably) be replaced with named constants
Can you tell me how to compile this rnnoise code using Emscripten toolkit.
Could someone please tell the command to convert WAV to PCM for processing and then back to WAV?
I use speech (2 hours), noise(2 hours), generate 24 hours dataset to train rnnoise, but the trained demo has bad performance, can you give some advice? thanks~
Does this repo have a checkpointed version of a working set of trained weights. Can't seem to find it.
Additionally, I see there is training code in the repo, and you mention the data format for the audio used for training here but how would I go about training a network with my own audio data?
Thanks!
First of all, awesome stuff! Kudos for open sourcing this gem.
I'm messing around with RNNoise a bit and stumbled upon an interesting behavior. For as long as my Macbook Pro fans are quiet, VAD works perfectly as well as the actual noise suppression. Once my fans turn up and go all the way to ~6200RPM, weird things start to happen:
I understand the temporal dynamic behavior of RNNs, but I wonder - could this be caused by insufficient data in the model (not covering my laptop fan properly), the NN layers structure or rather the code around it?
I'd like to understand better where to start digging. Thanks!
It took me a while to see that FRAME_SIZE
currently has to be hard-coded by the library user. There simply is no API for grabbing the internal value. Can we fix this?
hi.the function downsampling in pitch.c,what does this function do ?just as the function name,doing downsampling to the pitch buffer?why it need downsampling ?
Hi, this is a great project. I have tried your pretrained model, and it performed well in real-world scenarios.
I collected about 50 hours clean speech data and 3 hours noise data myself. I used your code to generate noisy speech data and to extract input features and output labels. But no matter I trained the model from scratch or by loading your pretrained model, the performance of my trained model is worse than that of your model. So is there any trick to construct the training data?
When listening to the constructed noisy speech, I found that noise in some places is too stong to hear nothing. Is that because the speech gain or noise gain was to big in that places? The speech gain was set to between 0.01 and 10. And the noise gain was set to between 0.03 and 10. It seemed that data overflow occured. What do you think about?
Looking forward to your reply. Thanks.
Did anyone test to run the code using fixed-point arithmetic?
How was the results? What Q number format did you use?
Best regards
According to your web site https://people.xiph.org/~jm/demo/rnnoise/ the initial version was in Keras. Is it possibile to make available this version? Thanks a lot
It always installs only these doc files:
%%PORTDOCS%%%%DOCSDIR%%/AUTHORS
%%PORTDOCS%%%%DOCSDIR%%/COPYING
%%PORTDOCS%%%%DOCSDIR%%/README
Fixed music background noise, speech; can remove the background noise noise. I think rrnoise is very suitable for doing such a thing. How do I control the GRU switch? I think it's an exciting thing.
Hello Jean-Marc! This is probably something that you already were planing but completely remove noise during low SNR portions of the sound doesn't always sounds good. Easily fixable in my plugin (https://github.com/lucianodato/speech-denoiser) but something you should consider for the library.
By the way I'm having troubles making it work in my plugin, does the library support 32 bit floats or do I have to cast them down to 16 bits???
Thanks for the great project.!
I really get a good inspiration on it.
Now I'm trying to convert from 48kHz samplingrate base code to 16kHz sampling rate code.
I change some parameters like followings.
and also change pitch related factors.
#define PITCH_MIN_PERIOD 60 ->20
#define PITCH_MAX_PERIOD 768 ->256
#define PITCH_FRAME_SIZE 960->320
is it correct?
I tried neural network training after changing above params.
But, it looks not working..
(there is just audio's gain suppression)
Is there anyone who can answer my question?
Thanks.
Very much appreciate this work! Not a Keras user here, but if I were to use the saved model "newweights9i.hdf5" directly from Python for inference (I know it's slow), do you have any code snippet I can follow along? Thanks!
Only one noise file and one voice file can be used for one training session. So how could I improve the noise reduction through multi noise?
The following code produces warnings in clang: *(int*)0=0;
Indirection of non-volatile null pointer will be deleted, not trap
It recommends the following:
Consider using __builtin_trap() or qualifying pointer with 'volatile'
though this question was posted earlier in the discussion threads, I could not see a definitive answer on the same. I would like to understand if the RNNoise Demo can be used for 8KHz / 16KHz PCM samples.
is the training depends on the sample rates?
can anyone suggest what changes I should perform in order to adapt the decoder to process 8KHz/16Khz sample rate speech
When trying to execute denoise_training the resulting f32 file is empty.
By preprocessing denoise (gcc -E -DTRAINING=1 -Wall -W -O3 -g -I../include denoise.c -o denoise_training.E
) it turns out the following statements have been commented out:
#if 0
compute_rnn(&noisy->rnn, g, &vad_prob, features);
interp_band_gain(gf, g);
#if 1
for (i=0;i<FREQ_SIZE;i++) {
X[i].r *= gf[i];
X[i].i *= gf[i];
}
#endif
frame_synthesis(noisy, xn, X);
for (i=0;i<FRAME_SIZE;i++) tmp[i] = xn[i];
fwrite(tmp, sizeof(short), FRAME_SIZE, fout);
#endif
and nothing will be written into fout (.f32 file).
Is it working as expected? How to prepare the training data?
Thank you in advance
My training data set is a 50000000 X 87 matrix. Each iteration my model reads the data in the same way. Its recommended to shuffle training data at the beginning of each epoch so that it generalizes better. Is it possible to shuffle the training data? Will it somehow make the data invalid?
Online it says this was written in python then converted to C. Is there anyway that I could get access to that code.
Hi,
Thanks for the terrific work.
Here is a question about mix gain, in denoise.c.
I notice the denoise effect is too strong, and speech_gain and noise_gain are set as below. So if there may be a distortion problem when these two audio mixed or when handle the audio in reality?
speech_gain = pow(10., (-40+(rand()%60))/20.); // 0.01~10
noise_gain = pow(10., (-30+(rand()%50))/20.); // 0.06~10
i want to test the effect of rnnoise, but if it only support 48K sample rate, it is not fit for embedded mcu.
I have installed the file as per the README, but facing with the below issue when running
./examples/rnnoise_demo input.pcm output.pcm
/home/navin/Documents/Speech/speech/speech-engine/toolkit/rnnoise/examples/.libs/lt-rnnoise_demo: error while loading shared libraries: librnnoise.so.0: cannot open shared object file: No such file or directory
What package did i missed?
First of all, this is a great work.
The code for preparing denoise_data9.h5 and the python code for denoise using the trained model are missing. Could you please share them too?
I am looking at the training code. Is there a script to generate 'denoise_data9.h5' from the raw audio + noise examples?
Hi, thanks for your job !
I try to convert my wav file with this command :
sox
But when I apply the network, I cannot play the output, it seems to have 0s of audio.
Can you share a correct command to convert a wav to the right format pls ?
Thanks
I get Segmentation fault: 11
when running $ ./examples/rnnoise_demo input.pcm output.pcm
How would I test this by streaming the microphone to speakers in realtime?
Hi,
I wonder if we use noise of only one place to train, could we get a noise reducer for this place while it can keep other place noise with it.
Anyone thought about or tried this?
“A Hybrid DSP/Deep Learning Approach to Real-Time Full-Band Speech Enhancement”
I want ask which magazine is this paper published in?I cant find answer on the Internet.
Is it possible to process live audio? Maybe something like VST effect.
I was wondering if it would be possible to integrate rnnoise into something like XMOS VocalFusion XVF3000. Microphone arrays with this chip is quite readily available and affordable.
At first sight it looks like good fit.
I am not sure
Hi,xiph:
This is not a issue,just a question,sorry to put it here:
1 if I want to train from scatch(using your training data,resample to 16k hz),what should I change the code?
2 Where can I get your training data?Since I did not found it.
3 how to get the c code,after training using keras?
Thanks!
Hi, I am trying to make the build for rnnoise, but I keep running into this error:
ld.exe: cannot find -link
I have searched extensively and cannot find a solution to this problem.
Heres the full error message:
$ make make all-am make[1]: Entering directory '/cygdrive/c/Users/goldw/Downloads/rnnoise-master/rnnoise-master' CCLD librnnoise.la C:/Program Files/mingw-w64/x86_64-7.2.0-posix-seh-rt_v5-rev1/mingw64/bin/../lib/gcc/x86_64-w64-mingw32/7.2.0/../../../../x86_64-w64-mingw32/bin/ld.exe: cannot find -link collect2.exe: error: ld returned 1 exit status make[1]: *** [Makefile:460: librnnoise.la] Error 1 make[1]: Leaving directory '/cygdrive/c/Users/goldw/Downloads/rnnoise-master/rnnoise-master' make: *** [Makefile:354: all] Error 2
dump_rnn.py
should be in training/
, not just some link.denoise_training
should use fout
again instead of stdout or fout
should be removed.#define VAD_GRU_SIZE 24.0
.voice.wav
and noise.wav
to rnn_data.h
and rnn_data.c
.See also: #12
Is there a limit to the training data set to be used?
How much data is it recommended for training?
To train the network has all the data to be resampled to 48kHz or is it possible to use 16kHz data?
Thank you in advance
How did you managed to compile with emcc
? I'm using it like emcc -v -std=c++11 -s MODULARIZE=1 -s LEGACY_VM_SUPPORT=1 -s WASM=0 rnn.cc
, but it does not work properly...
Hi, can we create windows .exe file, or is there any code made for windows??
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.