kinwaicheuk / nnaudio Goto Github PK

Audio processing by using pytorch 1D convolution network

License: MIT License

Python 99.67% Makefile 0.33%

spectrogram-conversion-toolbox pytorch audio-processing preprocessing stft melspectrogram spectrogram cqt-spectrogram neural-network 1d-convolution

nnaudio's Introduction

nnAudio

nnAudio is an audio processing toolbox using PyTorch convolutional neural network as its backend. By doing so, spectrograms can be generated from audio on-the-fly during neural network training and the Fourier kernels (e.g. or CQT kernels) can be trained. Kapre has a similar concept in which they also use 1D convolutional neural network to extract spectrograms based on Keras.

Other GPU audio processing tools are torchaudio and tf.signal. But they are not using the neural network approach, and hence the Fourier basis can not be trained. As of PyTorch 1.6.0, torchaudio is still very difficult to install under the Windows environment due to sox. nnAudio is a more compatible audio processing tool across different operating systems since it relies mostly on PyTorch convolutional neural network. The name of nnAudio comes from torch.nn

Installation

pip install git+https://github.com/KinWaiCheuk/nnAudio.git#subdirectory=Installation

pip install nnAudio==0.3.1

Documentation

https://kinwaicheuk.github.io/nnAudio/index.html

Comparison with other libraries

Feature	nnAudio	torch.stft	kapre	torchaudio	tf.signal	torch-stft	librosa
Trainable	✅	❌	✅	❌	❌	✅	❌
Differentiable	✅	✅	✅	✅	✅	✅	❌
Linear frequency STFT	✅	✅	✅	✅	✅	✅	✅
Logarithmic frequency STFT	✅	❌	✅	❌	❌	❌	❌
Inverse STFT	✅	✅	✅	✅	✅	✅	✅
Griffin-Lim	✅	❌	❌	✅	✅	❌	✅
Mel	✅	❌	✅	✅	✅	❌	✅
MFCC	✅	❌	❌	✅	✅	❌	✅
CQT	✅	❌	❌	❌	❌	❌	✅
VQT	✅	❌	❌	❌	❌	❌	✅
Gammatone	✅	❌	❌	❌	❌	❌	❌
CFP¹	✅	❌	❌	❌	❌	❌	❌
GPU support	✅	✅	✅	✅	✅	✅	❌

✅: Fully support ☑️: Developing (only available in dev version) ❌: Not support

¹ Combining Spectral and Temporal Representations for Multipitch Estimation of Polyphonic Music

News & Changelog

To view the full changelog, please go to CHANGELOG.md

version 0.3.1 (24 Dec 2021):

Added VQT feature #113

version 0.3.0 (19 Nov 2021):

Changed module naming. nnAudio.Spectrogram will be replaced by nnAudio.features in the future releases. Currently, various spectrogram types are accessible via both methods.

How to cite nnAudio

The paper for nnAudio is avaliable on IEEE Access

K. W. Cheuk, H. Anderson, K. Agres and D. Herremans, "nnAudio: An on-the-Fly GPU Audio to Spectrogram Conversion Toolbox Using 1D Convolutional Neural Networks," in IEEE Access, vol. 8, pp. 161981-162003, 2020, doi: 10.1109/ACCESS.2020.3019084.

BibTex

@ARTICLE{9174990, author={K. W. {Cheuk} and H. {Anderson} and K. {Agres} and D. {Herremans}}, journal={IEEE Access}, title={nnAudio: An on-the-Fly GPU Audio to Spectrogram Conversion Toolbox Using 1D Convolutional Neural Networks}, year={2020}, volume={8}, number={}, pages={161981-162003}, doi={10.1109/ACCESS.2020.3019084}}

Call for Contributions

nnAudio is a fast-growing package. With the increasing number of feature requests, we welcome anyone who is familiar with digital signal processing and neural network to contribute to nnAudio. The current list of pending features includes:

Invertible Constant Q Transform (CQT)

(Quick tips for unit test: cd inside Installation folder, then type pytest. You need at least 1931 MiB GPU memory to pass all the unit tests)

Alternatively, you may also contribute by:

Making a better demonstration code or tutorial

Dependencies

Numpy >= 1.14.5

Scipy >= 1.2.0

PyTorch >= 1.6.0 (Griffin-Lim only available after 1.6.0)

Python >= 3.6

librosa = 0.7.0 (Theoretically nnAudio depends on librosa. But we only need to use a single function mel from librosa.filters. To save users troubles from installing librosa for this single function, I just copy the chunk of functions corresponding to mel in my code so that nnAudio runs without the need to install librosa)

Other similar libraries

Kapre

torch-stft

nnaudio's People

Contributors

Stargazers

Watchers

Forkers

nnaudio's Issues

Example code doesn't work

I am trying to run this example, but I get the error:

RuntimeError: Input type (torch.cuda.FloatTensor) and weight type (torch.FloatTensor) should be the same

scipy 1.2.0
numpy 1.14.5
torch 1.7.1+cu110
Python 3.7.4

Apply for multi-channel signal

I am working on multi-channel sEMG signal.

Is there possible if the API can be applied on the multi-channel signal in shape (n_channels, n_samples) to produce STFT data in shape (n_channels, n_frequencies, n_frames)?

My current solution is also separate each channel to process and then combine them together, but I wonder an other solution to process all channels in a rows

Thank you.

CQT and Log Magnitude

Hi,

Thank you for this library it has been really straightforward to use in helping me migrate a Pytorch project using Mel-spectrogram to CQT.

Unfortunately my knowledge on spectrogram representations is essentially non-existent. I'd like to know if the spectrogram returned from the CQT1992v2 class is a 'log magnitude' of the spectrogram when the output_format is Magnitude?

I see the terms magnitude and log magnitude used frequently in research papers so I'm not sure if this is a case of them being used interchangeably or whether there is a difference.

Thanks

setting GPU through "torch.set_default_tensor_type" and other details

Hi ! I have installed your library and start running some trainings.
I have a couple of remarks from the first use.

You point torch.set_default_tensor_type as the way to run on GPU, it does work but it can cause some issues. For instance, in my case I cannot use set_default_tensor_type and multiple workers (it took me a bit to find that out): https://discuss.pytorch.org/t/cuda-initialization-error-when-dataloader-with-cuda-tensor/43390/3
I guess it would be better to have a way to set the device of the spectral operator itself. Such as an init argument or a function .to(device) ?

MelSpectrogram doesn't seem to have the arguments ,trainable_mel=,trainable_STFT=) that you mention in the readme. And sometimes it is a bit confused to guess if freq_bins,fmin,fmax can be let None for default setting or not.

Of course, it's at early development so it's great to have that in hands already. I just point out in case that helps. If you are interested, I can update you on if that does optimize as well as what I do so far (using torch.stft and librosa filters put on GPU), and if the other features do improve the fit of my models (I particularly look at the log scale STFT).

Best

CQT doesn't work on waveforms on short chunks like 0.5s

CQT doesn't work on waveforms on short chunks.
Is this something expected?

Include torchlibrosa in your feature matrix?

I appreciate your software. I am curious if you would include this new library in your feature matrix too?

https://github.com/qiuqiangkong/torchlibrosa

MelSpectrogram does not return magnitude

Hi, I went through the source code of your project and I noticed there is a difference in what is returned by MelSpectrogram and STFT.
The spectrogram returns magnitude, ie sqrt(Re2 + Im2), but MelSpectrogram only returns the powers - Re2 + Im2. Is there any reason behind this?
Would it be possibe to specify 'power' as an argument to MelSpectrogram similarly to librosa?

Is multi-channel (2/stereo) supported?

Hello. Looks like this library is intended to only be used on single channel audio - is this correct?

missing version

Thank you for the great package

It would be good if you could define a global variable __version__ in the top-level module nnAudio? This way, we could inspect the version of nnAudio we're currently running directly from inside Python.

Thanks!

Trainable?

Hello,

I am new to the audio domain and I am little bit confused about the exact meaning of 'Trainable' in nnAudio.

When I see the comparison matrix, it says torchaudio is not trainable but nnaudio is trainable.

What is the difference between 'trainable' in torchaudio and nnaudio?

Thank you:)

STFT Phase shape (1, ..., ..., 2)

Hi,
Thanks for the library!
I was wondering why the shape for the phase is (1, ..., ..., 2).
Shouldn't the phase, just like taking the magnitude, reduce the dimensionality?

From the documentation:
Phase will return the phase of the STFT reuslt, shape = (num_samples, freq_bins,time_steps, 2).

Spectrograms not updating well at low frequency bins

Hello, Thanks for putting in place a really useful library!

I'm working on the pneumonia detection problem. My dataset is super imbalanced, with 2000+ non-pneumonia cases and 142 cases, but I decided to stick with 142 cases of each label to keep the dataset balanced.

I am trying to apply the STFT layer in the following model:

with the following parameters:

self.spec_layer = Spectrogram.STFT(n_fft=256, hop_length=128, sr=8000, trainable=True, output_format="Magnitude")

Now, I'm observing some modifications of the spectrograms as it trains, but it seems like the trained spectrogram mainly gets updated at the higher frequency bins. It should be the low-frequency bins that inform the neural network of decision-making, since lung sounds are of the range 0-4000Hz and I sample at 8000 Hz. Here is a spectrogram of a pneumonia sample before training:

and here its updated version at, respectively, epochs 10, 50, and 150:

Since it's really hard to visualize, I generate a difference map ( = trained spectrogram at given epoch - original untrained spectrogram). Here are the difference maps at, respectively, epoch 10, 50 and 150:

It's difficult to see but there are some slight modifications of the lower frequency bins 0-24, only it's little, and barely any for bins 0-12.

Some of the training parameters are

parameters.lr = 1e-4
parameters.n_epochs = 150
parameters.batch_size = 32
parameters.audio_length = 5

I use nnAudio == 0.2.6.

What about the results of this job in speech tasks?

I wonder whether this job works in speech-related tasks, such as speech enhancement, asr.
Does anyone take a try in these speech tasks? And what about the results using this job?
Looking forward to any reply.

General improvements

Hi! I use this package a lot and I think it can be improved for further contributions:

All transformations are inside Spectrogram.py, but as the number of transformations grows up I think this is a little bit unsustainable. May we establish some categories (STFT, CQT, Mel-related, etc) so it is easier to maintain this? We can then import the models from Spectrogram.py
Spectrogram.py is not a very standard python name for a package, as it starts with uppercase. Changing it to lowercase would make sense? It would break the imports of packages relying on the actual version, but it looks more "pythonic" to me. Maybe we can have both and prompt deprecation warning when importing with the uppercase.
Tests are as well in a single file. Would it be possible to split into multiple files?

I want to work on those changes, but it would be nice to reach an agreement on how to structure it before doing a PR 😄

Trainable kernels CQT2010v2

Hi! First of all thanks for this amazing tool, it's saving me a lot of time!
When I try to make the CQT2010v2 trainable I have the following error

Exception has occurred: NameError
name 'trainable_kernels' is not defined

The trainable_kernels variable is referred here but not defined previously. In the documentation is stated that this can be differentiable as well.

The line of code I'm using is spec.CQT2010v2(sr=sr, n_bins=128, bins_per_octave=16, hop_length=1024, pad_mode='constant', trainable=True)

[Feature Request] Allow STFT kernels to be normalized

I think it will be nice to have normalization tools for STFT kernels (they exist in CQT in the forward pass with the parameter normalization_type) in order to control the norm of the output.

If you want I can do a PR.

device is not used in STFT and iSTFT

It is documented as a parameter, but is not used and throws an error if you pass it in.

win_length option not working for MelSpectrogram

I'm testing the win_length option using version 0.1.2.dev3 and 0.1.4a0. For both versions, I got an error when instantiating MelSpectrogram with this option.

mel = Spectrogram.MelSpectrogram(sr=16000, n_fft=512, device='cpu')
STFT filter created, time used = 0.0072 seconds
Mel filter created, time used = 0.0073 seconds
mel = Spectrogram.MelSpectrogram(sr=16000, n_fft=512, win_length=400, device='cpu')
Traceback (most recent call last):
File "", line 1, in
TypeError: init() got an unexpected keyword argument 'win_length'

$ pip list | grep nnAudio
nnAudio 0.1.2.dev3

Spectrogram.DFT() raise ValueError: too many values to unpack (expected 3)

Is Gammatone invertible?

CQT2010 problematic output

It seems I messed up something when updating nnAudio from 0.1.15 to 0.2.0.
The output for CQT2010 is very different from CQT2010v2. I suspect something is wrong during downsampling. But I don't have time to debug at the moment, will post this as an issue to reminder me later. Or if anyone knows the solution to this problem, a pull request is welcome.

The following code produces the above-mentioned issue. The code below is using nnAudio 0.2.2

import torch
import torch.nn as nn
from torch.nn.functional import conv1d, conv2d

import numpy as np
import torch
from time import time
import math
from scipy.signal import get_window
from scipy import signal
from scipy import fft
import warnings
from torch.nn.functional import fold, unfold
import nnAudio.Spectrogram as Spectrogram_old
from scipy.signal import chirp, sweep_poly
from nnAudio import Spectrogram

# Linear sweep case
fs = 44100
t = 1
f0 = 55
f1 = 22050
s = np.linspace(0, t, fs*t)
x = chirp(s, f0, 1, f1, method='linear')
x = x.astype(dtype=np.float32)
device='cpu'

n_bins = 100
bins_per_octave=12
window = 'hann'
filter_scale = 2
# window='hann'
normalization_type = 'wrap'

# Complex
stft2 = Spectrogram.CQT2010v2(sr=fs, fmin=f0, filter_scale=filter_scale,
                 n_bins=n_bins, bins_per_octave=bins_per_octave, window=window)
X2 = stft2(torch.tensor(x, device=device).unsqueeze(0), normalization_type=normalization_type)
X2 = torch.log(X2 + 1e-2)

#     np.save("tests/ground-truths/linear-sweep-cqt-2010-mag-ground-truth", X.cpu()) 


X3 = librosa.cqt(x, sr=fs, fmin=f0, filter_scale=filter_scale,
                 n_bins=n_bins, bins_per_octave=bins_per_octave, window=window)
X3 = np.log(abs(X3) + 1e-2)

stft1 = Spectrogram.CQT2010(sr=fs, fmin=f0, filter_scale=filter_scale,
                 n_bins=n_bins, bins_per_octave=bins_per_octave, window=window, pad_mode='constant')
X1 = stft1(torch.tensor(x, device=device).unsqueeze(0), normalization_type=normalization_type)
X1 = torch.log(X1 + 1e-2)

fig, axes = plt.subplots(1, 2, figsize=(12, 4), dpi=200)
axes[0].imshow(X1[0,:,:], aspect='auto', origin='lower')
axes[0].set_title('CQT2010')
axes[1].imshow(X2[0,:,:], aspect='auto', origin='lower')
axes[1].set_title('CQT2010v2')
# axes[1,0].imshow(X3[:,:], aspect='auto', origin='lower')

CQT2010v2 outputs 10 prints for every forward

Hi, so far I had good results switching to your library for computing spectral reconstruction losses for raw waveform generation !

However, using CQT2010v2 does 10 prints "downsample_factor = 4" for every forward, which is not desired when using it for minibatch training .. Can it be disabled please ? (maybe as an argument)

In my case I do not manually set "earlydownsample=False" and leave it to default.
Maybe it affects the print. Also, could you give a quick recommendation on this setting please ?

Thanks

STFT Reconstruction from Mel Spectrograms

I've been playing around with trying to reconstruct an STFT spectrogram from a Mel spectrogram (derived using the MelSpectrogram class) and wondered if you might be interested in incorporating something of this sort into nnAudio.

I've created a Colab Notebook to demonstrate my results. The reconstruction quality as of now is slightly inferior to that of librosa, but is orders of magnitude faster. I tried my hand at some hyperparameter tuning, but judging by the values used by Torchaudio and Librosa, it seems like a lot more iterations (and a much lower LR?) are needed to achieve optimal reconstruction quality (which I don't have the compute resources to run hyperparameter search for). I've included some quick quality/speed comparisons in the Colab notebook.

My implementation is based on Librosa's mel_to_stft and TorchAudio's InverseMelScale.

If this is something you might be interested in adding to nnAudio, I'd be happy to open a pull request for further review.

RuntimeError: Exporting the operator col2im to ONNX opset version 12 is not supported

hi, i convert the stft model to onnx version , but a error is raised.
'''
RuntimeError: Exporting the operator col2im to ONNX opset version 12 is not supported. Please feel free to request support or submit a pull request on PyTorch GitHub
'''

librosa's License

Hi, thanks for the code !

If I'm not mistaken, part of the code comes from librosa's source code and the corresponding license is not included.
It would make sense to add it IMO

Documention for STFT wrong

    inverse : bool
        To activate the iSTFT module or not. By default, it is False to save GPU memory.

should be

    iSTFT : bool
        To activate the iSTFT module or not. By default, it is False to save GPU memory.

You might also consider adding:

     The iSTFT kernel is not trainable. If you want a trainable iSTFT, use the iSTFT module.

Is there any built-in data augmentation function?

Hi, I'm so impressed by your wonderful project.
But, I want to know how can i augment the training data (ex. SNR control, time-stretch, speed perturbation, volume or pitch control, specaugment ...)
In the torchaudio, there are built-in parameter for the data transformations. https://pytorch.org/tutorials/beginner/audio_preprocessing_tutorial.html#transformations
Is there any built-in function or parameter for the data augmentation?

Incorrect example in nextpow2

A minor issue, but it may confuse readers that the example given in the doc string of nextpow2 is incorrect. nextpow2(6) equals 3, not 8.

nnAudio/Installation/nnAudio/Spectrogram.py

Lines 105 to 124 in fd90722

 def nextpow2(A): 

 """A helper function to calculate the next nearest number to the power of 2.  

  Parameters 

  ---------- 

  A : float 

  A float number that is going to be rounded up to the nearest power of 2 

  Returns 

  ------- 

  int 

  The nearest power of 2 to the input number ``A`` 

  Examples 

  -------- 

  >>> nextpow2(6) 

  8 

  """ 

 return int(np.ceil(np.log2(A)))

STFT.inverse() could not work correct under batch operations

Hi,

I tried the STFT and its inverse function found that could only be worked correct under batch size is 1.
Looks like more than 1 batch size, the batch audio would be merged together.
Is any idea for this issue?

Thanks.

torch.rfft has been moved

torch.rfft should be ported to torch.fft.rfft

nnAudio/Installation/nnAudio/Spectrogram.py

Line 566 in bf7a639

Vc = torch.rfft(v, 1, onesided=False)

nnAudio/Installation/nnAudio/Spectrogram.py

Line 2254 in bf7a639

ceps = torch.rfft(spec, 1, onesided=False)[:,:,:,0]/np.sqrt(self.N)

nnAudio/Installation/nnAudio/Spectrogram.py

Line 2257 in bf7a639

spec = torch.rfft(ceps, 1, onesided=False)[:,:,:,0]/np.sqrt(self.N)

nnAudio/Installation/nnAudio/Spectrogram.py

Line 2443 in bf7a639

ceps = torch.rfft(spec, 1, onesided=False)[:,:,:,0]/np.sqrt(self.N)

Is there a param named filter scale factor in nnAudio ???

There is a param named filter_scale in librosa.cqt.
filter_scale : float > 0
Filter scale factor. Small values (<1) use shorter windows
for improved time resolution.
How can i use this param in nnAudio???
Thank you!!!
librosa.cqt(
y,
sr=22050,
hop_length=512,
fmin=None,
n_bins=84,
bins_per_octave=12,
tuning=0.0,
filter_scale=1,
norm=1,
sparsity=0.01,
window='hann',
scale=True,
pad_mode='reflect',
res_type=None,
dtype=None,
)

some issues in applying

Hi, I meet some issues when applying nnAudio in my code:
1, the pip version(use pip install nnAudio) is different from the github version(download from github), for example, in pip version the output_format of STFT is a parameter in init, while in github version it is a parameter of forward. I don't know if there are other differences.
2, how to istft? in github version, there is a inverse method in STFT, but when I use it, there is a mistake:
File "/home3/lmh/anaconda3/envs/tasnet/lib/python3.7/site-packages/nnAudio-0.1.1-py3.7.egg/nnAudio/Spectrogram.py", line 572, in inverse
elif len(X.shape) == 4 and self.output_format == "Complex":
File "/home3/lmh/anaconda3/envs/tasnet/lib/python3.7/site-packages/torch/nn/modules/module.py", line 576, in getattr
type(self).name, name))
AttributeError: 'STFT' object has no attribute 'output_format'
by the way, I set two different stft_layer for stft and istft. and if I use one stft_layer for stft and istft, that will be correct. Must I use one stft_layer for stft and istft? when I use the pip version, there is no inverse method.
Can you check it please? Thanks a lot.

Learnable Window

Could you please elaborate why you have not used Learnable_window in STFT , Mel Spectrograms and MFCC but used in their inverse counterparts?

Installation instructions don't work

I am trying to install master so I can try the Gammatonegram. (Will you make a new pre-release soon?)

I follow the installation instructions here: https://kinwaicheuk.github.io/nnAudio/intro.html#installation

However, it doesn't work. You can see a minimal colab here:

https://colab.research.google.com/drive/15ItkuaZV0XCR7nzsC0mr_3jj374s-Knz?usp=sharing

!git clone https://github.com/KinWaiCheuk/nnAudio.git
!cd nnAudio/Installation/ && python3 setup.py install
from nnAudio import Spectrogram
>>> ImportError: cannot import name 'Spectrogram'

CQT

Hi, Does CQT automatically apply band_pass filtering based on fmin, fmax, sr given ?

Mel_Basis kernel

Hello! While training my model, I used MelSpectrogram function and realized that after few epochs(15-10), the mel basis(mel filterbank) are stuck and donot change at all. I even tried increasing the learning rate and loss regularization. Could you let me know what is wrong with it?

pypi install?

Would it be possible to include your library on pypi? Then we can pip3 install the appropriate version, and pin the specific version number.

Inverse STFT

Hello,

Can you please let me know if we can use multiple audio files in a single batch? Also, do you have an option to compute inverse STFT?

Explain difference to torch.stft

You mentioned in the readme that

Other GPU audio processing tools are torchaudio and tf.signal. But they are not using the neural network approach, and hence the Fourier basis can not be trained.

Can you explain this in more detail, please?

when would I benefit from the STFT in nnAudio compared to let's say torch.stft?
does it make a difference which STFT I use when I am interested in a time domain loss, hence does it change backprop?

Thanks!

Spectrogram.iSTFT backward very slow

Setting: Spectrogram.iSTFT(n_fft=1024, win_length=1024, freq_bins=None, hop_length=300,
fmax=7600, fmin=80, sr=24000, trainable_window=False, trainable_kernels=False,
verbose=False)
I used Spectrogram.iSTFT convert spectrograms back to waveforms, most batch backward is faster, but sometimes backward is very slow (0.116s VS 184s). When I used torch.istft instead, backward only took 0.07s.

inverse transform from logscale to linear scale stft

Hi !

Your repo is a pretty awesome find, I am especially interested in using the stft in log frequency.
Mel operations I was already doing myself using torch.stft and librosa filterbanks, but the more .. the better to experiment with.

May I ask, is there any way to transform a stft computed on log frequency scale back to linear frequency scale please ?

The use case I consider is putting some waveforms into log frequency spectrograms, filtering it and then putting back to linear frequency to then use the inverse stft back to time domain.

Thanks !

Data Parallelism support

I am trying to use this library with multiple GPU's but am getting the following error message:

RuntimeError: Caught RuntimeError in replica 1 on device 1.
Original Traceback (most recent call last):
  File "/home/furby/.local/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py", line 60, in _worker
    output = module(*input, **kwargs)
  File "/home/furby/.local/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/furby/Documents/models/mobilenet_v1.py", line 129, in forward
    audioOut = self.forward_audio(audio)
  File "/home/furby/Documents/models/mobilenet_v1.py", line 124, in forward_audio
    return self.audioNet(x)
  File "/home/furby/.local/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/furby/Documents/models/mobilenet_v1.py", line 96, in forward
    x = self.spec_layer(x)
  File "/home/furby/.local/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/furby/.pyenv/versions/3.7.6/lib/python3.7/site-packages/nnAudio/Spectrogram.py", line 681, in forward
    spec = torch.sqrt(conv1d(x, self.wsin, stride=self.stride).pow(2) \
RuntimeError: Expected tensor for argument #1 'input' to have the same device as tensor for argument #2 'weight'; but device 1 does not equal 0 (while checking arguments for cudnn_convolution)

I have looked into the point in which the code is stopping which seems to be when the batch is split across multiple GPUs and passed through the model. I believe this is because when I am initialising the model, I am configuring the MelSpectrogram with device=device. For a single GPU or a CPU this is fine, but moving over to multiple GPUs, this is being fixed to just one of the GPUs. I am not sure if the issue lies with my configuration or with the library itself, but I am after a way of having the device set on the fly.

My model implementation is as follows:

class Model(torch.nn.Module):
    def __init__(self, device="cpu"):
        super().__init__()
        config = dict(
                sr=16000,
                n_fft=400,
                n_mels=64,
                hop_length=160,
                window="hann",
                center=False,
                pad_mode="reflect",
                htk=True,
                fmin=125,
                fmax=7500,
                device=device
        self.spec_layer = Spectrogram.MelSpectrogram(**config)

    def forward(self, x):
        x = self.spec_layer(x)
        x = x.view(x.size(0), 1, x.size(1), x.size(2))
        x = super().forward(x)
        return x

looking forward to support time_stretch and pitch_shift.

looking forward to support some useful features, such as time_stretch and pitch_shift.

A compare table with detail might be helpful

First of all, thanks for your nice job! It works on my program perfectly.

But as there're already so many spectual processing Libs, a compare table in detail might make your work more outstanding.

I made a quick and dirty version for this:

Feature	nnAudio	torch.stft	kapre	torchaudio	tf.signal(or else tf. stuff)	torch-stft	librosa
Trainable	1	0	1	0	0	1	0
ModelConvert*	1	0	1	0	0	1	0
Speed(Need test**)	0	0	0	0	0	0	0
Differentiable(Not sure**)	1	1	1	1	1	1	0
Mel	1	0	1	1	1	0	1
MFCC	0	0	0	1	1	0	1
CQT	1	0	0	0	0	0	1
GPU support	1	1	1	1	1	1	0

*Model Convert: As many mobile neural networks only support limited OPs, to deploy on mobile device, the ability to convert to other framework (eg: ONNX) is important. May also referrance here.
**More check needed.

[Feature request] Log2 (octave) normalization in STFT

Currently when we use the parameter 'log' for STFT frequency scale it spaces logarithmically the frequencies with base e. I think it will be nice to have also logarithmic spacing with base 2, such that it will be more music related.
It may be just a new freq_scale option, like 'log2'. If you want I can do a PR :)

link to paper/citation

Hi - if I'm not mistaken, it's not obvious how to go from this repository to read the paper or cite it. I recommend adding info in the readme about that

STFT.inverse() fails on magnitude spectrograms when called multiple times

Using v0.1.4a0

I came across this when calling STFT.inverse() more than once without calling STFT.forward(..., output_format="Magnitude") in between.

Minimal code sample

from nnAudio import Spectrogram
from scipy.io import wavfile
import torch

# Get an STFT magnitude spectrogram
audio = torch.tensor(wavfile.read("song.wav")).float()
to_stft = Spectrogram.STFT()
stft = to_stft(audio, output_format="Magnitude")

# Reconstruct audio from STFT
reconstructed = to_stft.inverse(stft)

# Reconstruct audio from STFT (again)
reconstructed = to_stft.inverse(stft)   # AssertionError: Only perform inverse function on Magnitude or Complex spectrogram.

It looks like the issue is caused by an internal call to self.forward(..., output_format="Complex") within STFT.inverse().

Workaround

Explicitly set the output format before calling inverse()

to_stft.output_format = "Magnitude"
reconstructed = to_stft.inverse(stft)

Fix?

Perhaps making STFT() similar to CQT2010v2() and specifying the output format at instantiation:

class STFT(nn.Module):
    def __init__(self, ..., output_format="Complex"):
        self.output_format = output_format
        ...

Unless there's a particular reason for setting self.output_format in STFT.forward(), it's probably more intuitive to set a default self.output_format in STFT.__init__().
For the sake of backward compatibility, the output_format param of STFT.forward() could be retained to override the default value without changing the module's state.

No module named librosa_filters

I followed Readme and when I tried to from nnAudio import Spectrogram, it threw error as No module named librosa_filters. My librosa version is 0.7.0 just as mentioned in Readme.
I tried to change the source code from 'librosa_filters' to 'librosa.filters', it was able to import, but when doing the STFT it again threw 'pad_center' is not defined error, so I reckon that would be a function in librosa_filters? Where can I get this librosa_filters module?

Improve audio quality of Griffin-Lim implementation

What should be added / fixed / improved?

The current Griffin-Lim implementation results in audio output with a lot of noise and burst, as compared to librosa's implementation, even though using the same number of iterations for optimization.
"optim" module is not imported for Griffin-Lim (see here)
The test cases should enable GPU execution.

tf.signal is differentiable

Hi there,

I'm the author of Tensorflow's tf.signal package. Your paper says that tf.signal does not support gradients, however this is not true. All operations in tf.signal are fully differentiable and come with GPU and TPU support. Could you please update your paper on arXiv to correct this?

	def nextpow2(A):
	"""A helper function to calculate the next nearest number to the power of 2.

	Parameters
	----------
	A : float
	A float number that is going to be rounded up to the nearest power of 2

	Returns
	-------
	int
	The nearest power of 2 to the input number ``A``

	Examples
	--------

	>>> nextpow2(6)
	8
	"""
	return int(np.ceil(np.log2(A)))

kinwaicheuk / nnaudio Goto Github PK

nnaudio's Introduction

nnAudio

Installation

Documentation

Comparison with other libraries

News & Changelog

How to cite nnAudio

BibTex

Call for Contributions

Dependencies

Other similar libraries

nnaudio's People

Contributors

Stargazers

Watchers

Forkers

nnaudio's Issues

Minimal code sample

Workaround

Fix?

Recommend Projects

Recommend Topics

Recommend Org