Giter VIP home page Giter VIP logo

sincnet's People

Contributors

hbredin avatar mravanelli avatar rickychanhoyin avatar seungwonpark avatar vroger11 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

sincnet's Issues

speaker verification with GE2E loss

I use your Sincnet architecture with GE2E loss but loss value can't drop. I try to replace activation unit in DNN to logistic function but it doesn't work?
Can you give me some advice why this happen? or just because that loss function can't cooperate with sincnet ?

AttributeError: module 'torch' has no attribute 'flip'

Hi I am using torch 0.4.0 as mentioned in the README file. I get the following error. Is this because of a version problem or do I need to install additional dependancies(apart from the ones mentioned in the README)

Traceback (most recent call last):
File "speaker_id.py", line 228, in
pout=DNN2_net(DNN1_net(CNN_net(inp)))
File "/home/paperspace/anaconda3/envs/sincnet/lib/python3.6/site-packages/torch/nn/modules/module.py", line 491, in call
result = self.forward(*input, **kwargs)
File "/home/paperspace/SincNet/dnn_models.py", line 448, in forward
x = self.drop[i](self.act[i](self.ln[i](F.max_pool1d(torch.abs(self.convi), self.cnn_max_pool_len[i]))))
File "/home/paperspace/anaconda3/envs/sincnet/lib/python3.6/site-packages/torch/nn/modules/module.py", line 491, in call
result = self.forward(*input, **kwargs)
File "/home/paperspace/SincNet/dnn_models.py", line 144, in forward
band_pass_right= torch.flip(band_pass_left,dims=[1])
AttributeError: module 'torch' has no attribute 'flip'

Perform speaker identification

Hi, how would I perform inference for speaker identification using your implementation? For example, how would I get the predicted speaker for the TIMIT wavefile dr3/fjlr0/sa1.wav from your SincNet implementation?

Question about TIMIT_labels.npy and others

Hello,
First of all, thanks for sharing your work :)
I am new in speech recognition science field.

Wanted to ask couple of questions:

  1. TIMIT_labels.npy. Is there a way to track how id's (particular numbers) are assigned in the dictionary? I loaded the file from python command prompt. For example : 'train/dr5/fjxm0/sx311.wav': 267. How is 267 assigned (maybe there is a reference somewhere in TIMIT dataset)? As i understand the id in such case is: fjxm0. So, is 267 important? Or, for example (all occurrences of 267 in the file can be changed to some new value).

  2. Can you give some guidelines about how to choose training and test sets for the model? E.g. what percentage for training, and testing?

  3. File: model_raw.pkl. Is the the file for trained model? How can i use this file?

  4. As i understand SincNet solves 'speaker identification task'. What are the differences, compared to speech recognition task? How can i adapt Sincet for individual speaker? (e.g. to compare results for the same speaker spoken twice) Maybe for individual speaker, i need to design test set in a way that only audio files for that speaker is included?

Looking forward to hearing from you :)

Best regards,
Andrius L.

training on new datset

while running speaker id on my dataset I am getting the following error

Traceback (most recent call last):
  File "speaker_id.py", line 227, in <module>
    [inp,lab]=create_batches_rnd(batch_size,data_folder,wav_lst_tr,snt_tr,wlen,lab_dict,0.2)
  File "speaker_id.py", line 52, in create_batches_rnd
    sig_batch[i,:]=signal[snt_beg:snt_end]*rand_amp_arr[i]
ValueError: could not broadcast input array from shape (3200,2) into shape (3200)

How do I run the trained model file

@mravanelli
I have completed the training with my own dataset. Now I wanted to use the trained model to make predictions with wav files. How do I get the prediction? Can you please help?

Thanks!
Sivam.

SincNet Weights

Hi sir, can you please share pre-trained model or it's weights. :)

Reproducing LibriSpeech results

Hi,

I've been able to reproduce to a very close degree the results of the TIMIT experiment. However, I believe to reproduce the LibriSpeech results, I'll need a bit more information if you don't mind. I've currently downloaded the clean 100 and 360 hour datasets as well as the 500 hour "other" dataset. This has about 100 fewer speakers than the number you reference in your paper. Could you provide the names of the Libri datasets from which you drew your speakers?

How did you preprocess them? I know in the paper you mention using only 12-15s for training. Did you just taking the first 15s of each utterance (or less if the utterance was shorter)? If not, could you explain how you arrived at the training utterances as well as the preprocessing you've done?

Could you also provide the training/testing data lists similar to TIMIT_train.scp and TIMIT_test.scp?

Again, this is aimed at trying to reproduce the Libri results, so as much specificity as possible would be great!

Thanks for all you've done!

how about generalization of sincNet ?

First, thanks for your contribution.
In your expriment, Classification Error Rates - CER% for speaker-id task and Equal Error Rate - EER% for speaker verification used, howeve, at present, deep feature representition and similarity scores were used to speaker recognition, so whether if should explain generalization of sincNet from A dataset training to B dataset testing.

TIMIT_preparation.py

Traceback (most recent call last):
File "/home/administrator/Videos/SincNet-master/TIMIT_preparation.py", line 45, in
in_folder = sys.argv[1]
IndexError: list index out of range
please give the solution to above mentioned error...

How to do the GPU parallel with this code?

Hello,

I can succeed in running this code with only one gpu. But I failed when I tried to use DataParallel to
call several gpus. How can I use several gpus to faster the training process?

Main code is listed:

CNN_net=CNN(CNN_arch)
CNN_net.cuda()
CNN_net=nn.DataParallel(CNN_net)

DNN1_net=MLP(DNN1_arch)
DNN1_net.cuda()
DNN1_net=nn.DataParallel(DNN1_net)

DNN2_net=MLP(DNN2_arch)
DNN2_net.cuda()
DNN2_net=nn.DataParallel(DNN2_net)

The error info like this:

Traceback (most recent call last):
File "speaker_id.py", line 180, in
DNN1_arch = {'input_dim': CNN_net.out_dim,
File "/Work19/2017/xxx/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 535, in getattr
type(self).name, name))
AttributeError: 'DataParallel' object has no attribute 'out_dim'

Arxiv Paper Link

Dear Mirco,

I couldn't find the link to arxiv paper. When will it be available?

Thanks.

Adding new speakers/using transfer learning

Hello,

First of all, thanks for all the great work :)

I managed to reproduce the results from the paper using the TIMIT dataset and I am now thinking about the following scenario:

I have a dataset of 500 speakers, I trained the model on it, I get a good enough accuracy and the model can reliably identify one of those 500 speakers from an audio sample. Now I need to add one or more new speakers, let's say 5; the desired outcome is a model that can identify one of the now 505 speakers. This could be a case that repeats in the future, as I get more audio data.

I currently have these approaches in mind:

  1. Train the model from scratch every time I need to add new speakers. The disadvantage to this is that I don't leverage any accumulated knowledge from previous trainings.

  2. Use transfer learning somehow - load the weights from the "500 speakers" trained model and replace the softmax layer with one that has 505 classes, then train a few more epochs.

  3. Same as 2, except we also freeze all the layers except softmax.

How would you approach this? If 2 and 3 are viable options, how would you implement that? Would changing "class_lay" in the config to 505 and training with the new dataset be enough for 2? How would you approach freezing the non-softmax layers?

Thanks again,
Bogdan

Training on GPU got error on CPU decoding

I got this error while doing d_vector extraction. Is D-vector output of 2nd CNN layers? Error seems to be due to availability of Torch GPU version and not CPU. How did you handle in your case? Thanks! Below is error message:

import torch
File "anaconda3/lib/python3.6/site-packages/torch/init.py", line 80, in
from torch._C import *
ImportError: /lib64/libc.so.6: version `GLIBC_2.14' not found (required by /anaconda3/lib/python3.6/site-packages/torch/_C.cpython-36m-x86_64-linux-gnu.so)

Performance on larger datasets

Hi Mirco,

I'm curious if you had to make any adjustments to the structure of the model to handle the >2k speakers in LibriSpeech? I've been attempting to fit a model on >3k speakers, and the Sentence Classification Error doesn't drop below ~50%. I have ~3min of speech per speaker.

Could you provide the res.res file for LibriSpeech that you provided for TIMIT?

Installation Update

As promised -- an update

Moved from mravanelli/pytorch-kaldi#88 to here at SyncNet
On pytorch-kaldi #88: I stated:
I am going to attempt to run the SyncNet speaker id experiment.
If it fails to run I will look into more hardware. So ...

Attempt to install on Raspberry Pi 3B+

Requirements:
Linux
Python 3.6/2.7
pytorch 1.0
pysoundfile
anaconda

Bottom line toward Requirements:
(I am skipping the installation output except for the last line)

Linux: Raspbian GNU/Linux 9 (stretch) *** YES ***

Python 3.6/2.7 *** NO *** I have Python 3.5.3 -- Anaconda installed 3.4

pi@raspberrypi:~ $ conda install anaconda-client
Anaconda *** Maybe *** The following NEW packages will be INSTALLED:
anaconda-client: 1.0.2-py34_0
clyent: 0.4.0-py34_0
freetype: 2.5.2-2
jpeg: 8d-0
libpng: 1.6.17-0
libtiff: 4.0.2-1
pillow: 2.9.0-py34_0
pip: 7.1.2-py34_0
python-dateutil: 2.4.2-py34_0
pytz: 2015.4-py34_0
setuptools: 18.1-py34_0
six: 1.9.0-py34_0
wheel: 0.24.0-py34_0

pi@raspberrypi:~ $ sudo apt-get install pytorch
pytorch E: Unable to locate package pytorch
pi@raspberrypi:~ $ pip3 install pytorch
Exception: You tried to install "pytorch". The package named for PyTorch is "torch"
pi@raspberrypi:~ $ sudo apt-get install torch
RuntimeError: PyTorch does not currently provide packages for PyPI (see status at pytorch/pytorch#566).
Please follow the instructions at http://pytorch.org/ to install with miniconda instead.
pi@raspberrypi:~ $ conda install pytorch=0.4.1 -c pytorch
pytorch *** NO *** Error: No packages found in current linux-armv7l channels matching: pytorch 0.4.1*

pysoundfile ( conda install -c conda-forge pysoundfile)
pi@raspberrypi:~ $ conda install -c conda-forge pysoundfile
Fetching package metadata: ......
Solving package specifications:
Error: Could not find some dependencies for pysoundfile: cffi
pi@raspberrypi:~ $ conda install --channel https://conda.anaconda.org/poppy-project cffi
The following packages conflict with each other:
cffi
python 3.4*
pysoundfile *** NO ***

Toward Requirements: 1.5 out of 5

************************* PLAN B *************************

Order a new system

https://developer.nvidia.com/embedded/buy/jetson-nano-devkit $99
NVIDIA® Jetson Nano™ Developer Kit is a small, powerful computer that lets you run multiple neural networks in parallel for applications like image classification, object detection, segmentation, and speech processing.
JetPack is compatible with NVIDIA’s world-leading AI platform for training and deploying AI software, and reduces complexity and effort for developers by supporting many popular AI frameworks, like TensorFlow, PyTorch, Caffe, and MXNet. It also includes a full desktop Linux environment and out-of-the-box support for a variety of popular peripherals, add-ons, and ready-to-use projects.

Technical Specifications
GPU 128-core Maxwell
CPU Quad-core ARM A57 @ 1.43 GHz
Memory 4 GB 64-bit LPDDR4 25.6 GB/s
Storage microSD (not included)
Video Encode 4K @ 30 | 4x 1080p @ 30 | 9x 720p @ 30 (H.264/H.265)
Video Decode 4K @ 60 | 2x 4K @ 30 | 8x 1080p @ 30 | 18x 720p @ 30 (H.264/H.265)
Camera 1x MIPI CSI-2 DPHY lanes
Connectivity Gigabit Ethernet, M.2 Key E
Display HDMI 2.0 and eDP 1.4
USB 4x USB 3.0, USB 2.0 Micro-B
Others GPIO, I2C, I2S, SPI, UART
Mechanical 100 mm x 80 mm x 29 mm

https://www.adafruit.com/product/1995 $8
5V 2.4A Switching Power Supply

https://www.sandisk.com/home/memory-cards
SanDisk microSDHC™ 64GB SD Card $13

Existing USB Keyboard, USB Mouse, HDMI screen

Toward Requirements:
Linux: Ubuntu 18.04 LTS
Python 3.6
PyTorch 1.1

Watched a tutorial on configuring, booting up, and running some examples.
And another on Introduction to Deep Learning

Unknown out of 5 (I will update you in a week / 10 days)

Is the d-vector extracted strictly according to the original d-vector paper?

Table 2 of the paper compares the performance on the speaker verification task bewteen SincNet with d-vector and several other models. How's the d-vector extracted? Which of the following is correct?

  1. The naive way. Namely a random chunk (200 ms, not necessarily comes from the same utterance) of the speaker's audio is pre-processed (\x -> x / abs(max(y)), where x is the current chunk and y is whole signal), and then fed to the network, the output of the last hidden layer (i.e. the output of DNN1_net) is obtained, do the same for other audio chunks (thus many audio chunks are consumed), one would get many vectors, denoted d_1, d_2, ..., d_n, each d_i is L2 normalized to get d'_i and the final d-vector of this speaker is mean(d'_1, d'_2, ..., d'_n).

  2. The "original" way. A speaker is represented by a sequence of utterance, {O_i: i}, each utterance is consisted of a sequence of frames, {o_j: j}, during enrollment/verification, each o_j is companied by its context, (the original paper uses 30 frames to the left and 10 frames to the right), and fed to the network, the output of the last hidden layer is extracted, denoted with a_j, a_j is then L2 normalized, the d-vector of utterance O_i is sum_{j} {a_j}; the d-vector of the speaker is mean({d-vector of O_i: i})

The d-vector paper: https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/41939.pdf

PS. in both cases, I assume that the network used for extracting d-vector is the same one that is produced by speaker_id.py.

Utilization rate of the gpu

my data is very very big, Utilization rate of the gpu(24G) is very low when trainning, 128 batch size used 1G of 24G, I try to add batch size to improve the utilization rate, But I worry about the impact on convergence, do you have some way which can improve the utilization rate of gpu? thank you

LayerNorm == torch.nn.InstanceNorm1d ?

I believe what you call LayerNorm is actually InstanceNorm1d in pytorch.

SincNet/dnn_models.py

Lines 112 to 123 in 488c982

class LayerNorm(nn.Module):
def __init__(self, features, eps=1e-6):
super(LayerNorm,self).__init__()
self.gamma = nn.Parameter(torch.ones(features))
self.beta = nn.Parameter(torch.zeros(features))
self.eps = eps
def forward(self, x):
mean = x.mean(-1, keepdim=True)
std = x.std(-1, keepdim=True)
return self.gamma * (x - mean) / (std + self.eps) + self.beta

It is my understanding that Layer Normalization would actually have one weight/bias per sample in the sequence, while Instance Normalization only has one per channel. Do you confirm?

References

Question about paper <LEARNING SPEAKER REPRESENTATIONS WITH MUTUAL INFORMATION>

Hello, I've reading your paper and I'm a little curious about the calculation of mutual infomation.
when transfer the MI to KL based method which contains a joint distribution and two marginal production.

but how to understand when we train the network, we sample (z1,z2)as the joint distribution, and sample (z1,zrand) as the other? what's the true joint distribution and two marginal distribution?Thanks.

Rationale for dividing speech into chunks of 200ms with 10ms overlap

Hi, I am trying to check the performance of SincNet on VoxCeleb dataset. I am wondering about the rationale of extracting chunks of 200ms windows of signals during training, and also the 10ms overlap that you have in test? Does the model depend on this?

Can I use longer chunks like 3s of audio like the VoxCeleb paper seems to be using? Seeing that VoxCeleb is a much larger dataset?

TIMIT (.wrd, .txt, .phn) file interpretations (numbers in front of the line)

Hello,

I want to find out more details about TIMIT database (in particular .TXT, .PHN and .WRD files):
For example (in folder train/dr1/FCJF0).

File SI1657.TXT i have the following:
0 45466 Or borrow some money from someone and go home by bus?

Question: What does numbers '0' and '45466' refer to? Perhaps time duration in miliseconds?

File SI1657.WRD :
2120 3533 or
3533 8200 borrow
8200 12291 some
12291 15325 money
15325 18435 from
18435 25984 someone
25984 28960 and
28960 31000 go
31000 34599 home
34599 36200 by
36200 43480 bus

Question: What does numbers (in first two columns) refer to?

File SI1657.PHN (took a fragment) :
0 2120 h#
2120 2725 q

Question: What does numbers (0, 2120 and 2120, 2725) refer to?

Another question: Would SincNet work if no .phn (phonetics) files are provided to the dataset?

Best regards,
Andrius L.

error in create_batches

Hi,
First of all, thanks for sharing your work :)

i use google colab to excute this project

when i run ''!python3 "/content/keras-sincnet/speaker_id.py" --cfg=/content/keras-sincnet/cfg/SincNet_TIMIT.cfg''

I get the following error:

File "/content/keras-sincnet/speaker_id.py", line 227, in
[inp,lab]=create_batches_rnd(batch_size,data_folder,wav_lst_tr,snt_tr,wlen,lab_dict,0.2)
File "/content/keras-sincnet/speaker_id.py", line 53, in create_batches_rnd
lab_batch[i]=lab_dict[wav_lst[snt_id_arr[i]]]
KeyError: 'TRAIN/DR1/FSJK1/SX305.WAV'

Can you give some guidelines about how to resolve this error

TIMIT_preparation.py

File "/home/administrator/Videos/SincNet-master/TIMIT_preparation.py", line 51, in
copy_folder(in_folder,out_folder)
File "/home/administrator/Videos/SincNet-master/TIMIT_preparation.py", line 36, in copy_folder
shutil.copytree(in_folder, out_folder, ignore=ig_f)
File "/home/administrator/Music/dell/lib/python3.5/shutil.py", line 303, in copytree
names = os.listdir(src)
NotADirectoryError: [Errno 20] Not a directory: '/home/administrator/Videos/SincNet-master/TIMIT_preparation.py'

need solution for the above mentioned error...

How can I used sincnet for speaker verification task?

Thank you for your contribution.
In paper you are mentioning about the speaker verification performance but in code I did not found any code related to speaker verification. Please can you explain me how can i implement verification.

trained model error

Hi, the trained model seems not work when compute_d_vector:
Missing key(s) in state_dict: "conv.0.low_hz_", "conv.0.band_hz_"
the keys in checkpoint_load['CNN_model_par']:

conv.0.filt_b1
conv.0.filt_band
conv.1.weight
conv.1.bias
conv.2.weight
conv.2.bias
bn.0.weight
bn.0.bias
bn.0.running_mean
bn.0.running_var
bn.1.weight
bn.1.bias
bn.1.running_mean
bn.1.running_var
bn.2.weight
bn.2.bias
bn.2.running_mean
bn.2.running_var
ln.0.gamma
ln.0.beta
ln.1.gamma
ln.1.beta
ln.2.gamma
ln.2.beta
ln0.gamma
ln0.beta

Thanks a lot.

computed d-vectors aren't consistent across runs

Thanks for this great toolkit!

I've train the speaker-id model and I'm now trying to extract the d-vectors for various wav files.

I notice that if I print the final utterance d-vector, I get different vector values each time:

For example, on the first run, I get:

[0.00486447 0.01101663 0.00926225 ... 0.0081592  0.00329391 0.01262286]

And the 2nd time I start the script, I get

[-6.3974774e-05 -1.0634900e-02 -1.0657464e-04 ...  1.9263949e-02
 -2.3915395e-03  4.1378587e-02]

And again:

[0.00949155 0.01689023 0.00099393 ... 0.01041446 0.01067957 0.01114707]

If I put twice the same audio file in the wav list, I get consistent values within a run, but always different across run.

Any clues? I get this either running on the GPU or the CPU.

Improving SincNet results on TIMIT by adding reverberation

Not an issue, but just wanted to post that you can further decrease the sentence classification error by reverberating each training call and including them in the training dataset (effectively doubling the training size). The error drops to 0%. I will try with Libri as well

Visualizing sinc filter kernels

Hi Marco,

thanks for sharing this great work! I'm trying to visualize the weights of the sinc layer in time and frequency domain but I'm having trouble getting it right. Some filters don't look like a single bandpass but have different amplitudes for different frequencies (e.g., see the right example in the eleventh row in the figure below).

The code below just loads a trained model from the saved checkpoint, computes the filters in time domain, and visualizes them alongside their Fourier transform. I'm certain it's just a problem with visualizing them. If you could have a look or share the example from Figure 2 in the paper, that would be highly appreciated.

Thanks,
Benedikt

%matplotlib notebook
import matplotlib.pyplot as plt
import numpy as np
import math
import torch

sampling_frequency = 16000
num_filters = 80
filter_length = 251
frequency_scaling = sampling_frequency * 1.0

# Load trained model from checkpoint
torch_checkpoint = torch.load("exp/SincNet_TIMIT/model_raw.pkl", map_location='cpu')
low_hz = torch_checkpoint["CNN_model_par"]["conv.0.filt_b1"]
band_hz = torch_checkpoint["CNN_model_par"]["conv.0.filt_band"]

# Compute filter kernel. Except from renaming some variables, this should be the same as the original code.
def flip(x, dim):
    xsize = x.size()
    dim = x.dim() + dim if dim < 0 else dim
    x = x.contiguous()
    x = x.view(-1, *xsize[dim:])
    x = x.view(x.size(0), x.size(1), -1)[:,
        getattr(torch.arange(x.size(1) - 1, -1, -1), ('cpu', 'cuda')[x.is_cuda])().long(), :]
    return x.view(xsize)

def sinc(band, t_right):
    y_right = torch.sin(2 * math.pi * band * t_right) / (2 * math.pi * band * t_right)
    y_left = flip(y_right, 0)

    y = torch.cat([y_left, torch.autograd.Variable(torch.ones(1)), y_right])

    return y

filters = torch.autograd.Variable(torch.zeros((num_filters, filter_length)))
N = filter_length
t_right = torch.autograd.Variable(torch.linspace(1, (N - 1) / 2, steps=int((N - 1) / 2)) / sampling_frequency)

min_freq = 50.0;
min_band = 50.0;

filt_beg_freq = torch.abs(low_hz) + min_freq / frequency_scaling
filt_end_freq = filt_beg_freq + (torch.abs(band_hz) + min_band / frequency_scaling)

n = torch.linspace(0, N, steps=N)

# Filter window (hamming)
window = 0.54 - 0.46 * torch.cos(2 * math.pi * n / N);
window = torch.autograd.Variable(window.float())

for i in range(num_filters):
    low_pass1 = 2 * filt_beg_freq[i].float() * sinc(filt_beg_freq[i].float() * frequency_scaling, t_right)
    low_pass2 = 2 * filt_end_freq[i].float() * sinc(filt_end_freq[i].float() * frequency_scaling, t_right)
    band_pass = (low_pass2 - low_pass1)

    band_pass = band_pass / torch.max(band_pass)

    filters[i, :] = band_pass * window

filters = filters.view(num_filters, 1, filter_length)
filters = filters.detach().numpy()

# Visualize filter kernels (similar to https://gist.github.com/endolith/236567)
# Two filters and their Fourier transform per row.
num_cols = 4
num_rows = int(np.ceil(num_filters * 2 / num_cols))

fig, axes = plt.subplots(num_rows, num_cols, figsize=(9, 80))
for i in range(num_filters):
    spatial_ax = axes[(i * 2) // 4, (i * 2) % 4]
    frequency_ax = axes[(i * 2 + 1) // 4, (i * 2 + 1) % 4]
    
    weights = filters[i, 0, :]
    
    # Frequency computation
    ampl = 1/N * np.abs(np.fft.rfft(weights))
    
    # RFFT frequency bins
    freqs = np.fft.rfftfreq(N, 1/sampling_frequency)
    
    spatial_ax.plot(weights)
    frequency_ax.stem(freqs, ampl)
    
fig.tight_layout()

visualize_weights

Taking the whole speech sequence as input without chunking

Hi Micro, many thanks for the great work and sharing the code! It's super useful for the work I am doing. So for the speaker identification task, each training sample has the length of 3200 since fs=16000; cw_len=200; wlen=int(fs*cw_len/1000.00)=3200. And for testing, voting over even smaller chunks are performed.

I have a speech classification task that, I think, it's the best to take the whole speech sequence as input without chunking. Currently, is it possible to use variable length input sequence for sincnet at all? If not, then I would pad each batch with zeros to the max length (in each batch). Would sincnet be affected with padded batches?

Oh btw, my speech sequences vary in length by a lot actually. What would be your suggestion Micro? Many thanks again!

Configuration

I configured the following which I believe will allow me to load the software I need and run SincNet. The manufacturer said it would run but questioned why I bought so much memory and only one monitor. If you see holes that need plugging, I would appreciate any comments.
Thank you, Gerard

Workstation 7920
Intel Xeon Gold 6140 2.3GHz, 3.7GHz Turbo 18C, 10.4GT/s 3UPI, 25MB Cache, HT (140W) DDR4-2666
Ubuntu Linux 18.04
NVIDIA Quadro GV100, 32GB, 4 DP (Precision xx20 Towers)
384GB 12x32GB DDR4 2666MHz RDIMM ECC
M.2 256GB PCIe NVMe Class 40 Solid State Boot Drive
M.2 2TB PCIe NVMe Class 40 Solid State Drive
Keyboard, Mouse, Monitor
Ethernet LAN

Question about gap between err_tr and err_te

I really appreciate you released the paper together with the source code. Also, I have a question on the performance gap between err_tr and err_te:

According to the result shown in res.res, after epoch 360, err_tr=0.009600, err_te=0.419954. The gap between training and validation's performance seems quite large. Does it mean the model suffers some kind of overfitting problem?

voxCeleb1 and libri speech

Hi Mirco,
Thanks for the great work! I was wondering if you plan to share the data preparation recipe for voxCeleb1 and librispeech that can allow us to reproduce other experiments from your paper.

Run TIMIT data preparation.

Hello,When I was doing this step, the following problem occurred:
$ python TIMIT_preparation.py $TIMIT_FOLDER $OUTPUT_FOLDER data_lists/TIMIT_all.scp
Traceback (most recent call last):
File "TIMIT_preparation.py", line 41, in
out_folder=sys.argv[2]
IndexError: list index out of range

I hope you can give me some answers, thank you very much.

How to divide the test set?

Hello!
Firstly, thanks for sharing the code of your paper, it's really a fantastic work!
But I'm quite confused when I'm going to test my own model. When we are going to test the model in Speaker Identification, we should divide the test set into two parts, some of the data is used for enrollment and the others is used for test.
For example, there's ten sentences of each speaker, maybe it's not appropriate to set nine of the sentences for enrollment and one for test, as the model may learn much from the nine sentence and it's easy to make a correct predition of the rest one during test. Thus, in this condition, the accuracy might be higher than it truly should be. But that's not what I want.
I read your code carefully, but didn't find the answer, sorry about that. :(
So could you please tell me the way to divide the test set?

Using my own database

Hi,
I wanted to run speaker_id with our own database. We will have train and test wav files. How do we go about doing this?

Thanks,
Sivam.

Questions aiming experiments replication

I really appreciate you released the paper together with the source code. Also, I have few questions:

  1. I have plotted the "err_te" from "res.res" file and it seems to be different than Fig. 4 from the paper. Fig. 4 contains values below 0.4% FER, while the "res.res" file has a minimum value for "err_te" equal to 0.410. Why?
  2. If "err_te_snt" represents the classification error, that means the classification error is equal to 0.57% after 360 epochs (according to "res.res" file). In the paper it is equal to 0.85% (Table 1). Why?
  3. What is the experimental setup for speaker verification?

Thank you!

Re-training the trained SincNet on new dataset

Hi Mirco,
I have a task where I need to take a pre-trained SincNet and re-train it on our data. Both datasets are prepared according to your protocols. SincNet trained on first dataset is available as well. Now, I want to remove the output layer of trained network and use a new output corresponding to new speakers from Dataset 2. What scripts I need to modify to do it. Do you think such a strategy is good rather than combining both datasets and re-train on composite dataset. In real-world, we get more data after every few month and re-training could be time consuming. So, I want to test this strategy to initialized from a previously trained network hoping that is could converge faster. Thanks!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.