mravanelli / sincnet Goto Github PK
View Code? Open in Web Editor NEWSincNet is a neural architecture for efficiently processing raw audio samples.
License: MIT License
SincNet is a neural architecture for efficiently processing raw audio samples.
License: MIT License
I use your Sincnet architecture with GE2E loss but loss value can't drop. I try to replace activation unit in DNN to logistic function but it doesn't work?
Can you give me some advice why this happen? or just because that loss function can't cooperate with sincnet ?
how to make TIMIT_labels.npy file ?
Hi I am using torch 0.4.0 as mentioned in the README file. I get the following error. Is this because of a version problem or do I need to install additional dependancies(apart from the ones mentioned in the README)
Traceback (most recent call last):
File "speaker_id.py", line 228, in
pout=DNN2_net(DNN1_net(CNN_net(inp)))
File "/home/paperspace/anaconda3/envs/sincnet/lib/python3.6/site-packages/torch/nn/modules/module.py", line 491, in call
result = self.forward(*input, **kwargs)
File "/home/paperspace/SincNet/dnn_models.py", line 448, in forward
x = self.drop[i](self.act[i](self.ln[i](F.max_pool1d(torch.abs(self.convi), self.cnn_max_pool_len[i]))))
File "/home/paperspace/anaconda3/envs/sincnet/lib/python3.6/site-packages/torch/nn/modules/module.py", line 491, in call
result = self.forward(*input, **kwargs)
File "/home/paperspace/SincNet/dnn_models.py", line 144, in forward
band_pass_right= torch.flip(band_pass_left,dims=[1])
AttributeError: module 'torch' has no attribute 'flip'
Hi, how would I perform inference for speaker identification using your implementation? For example, how would I get the predicted speaker for the TIMIT wavefile dr3/fjlr0/sa1.wav from your SincNet implementation?
Hello,
First of all, thanks for sharing your work :)
I am new in speech recognition science field.
Wanted to ask couple of questions:
TIMIT_labels.npy. Is there a way to track how id's (particular numbers) are assigned in the dictionary? I loaded the file from python command prompt. For example : 'train/dr5/fjxm0/sx311.wav': 267. How is 267 assigned (maybe there is a reference somewhere in TIMIT dataset)? As i understand the id in such case is: fjxm0. So, is 267 important? Or, for example (all occurrences of 267 in the file can be changed to some new value).
Can you give some guidelines about how to choose training and test sets for the model? E.g. what percentage for training, and testing?
File: model_raw.pkl. Is the the file for trained model? How can i use this file?
As i understand SincNet solves 'speaker identification task'. What are the differences, compared to speech recognition task? How can i adapt Sincet for individual speaker? (e.g. to compare results for the same speaker spoken twice) Maybe for individual speaker, i need to design test set in a way that only audio files for that speaker is included?
Looking forward to hearing from you :)
Best regards,
Andrius L.
while running speaker id on my dataset I am getting the following error
Traceback (most recent call last):
File "speaker_id.py", line 227, in <module>
[inp,lab]=create_batches_rnd(batch_size,data_folder,wav_lst_tr,snt_tr,wlen,lab_dict,0.2)
File "speaker_id.py", line 52, in create_batches_rnd
sig_batch[i,:]=signal[snt_beg:snt_end]*rand_amp_arr[i]
ValueError: could not broadcast input array from shape (3200,2) into shape (3200)
@mravanelli
I have completed the training with my own dataset. Now I wanted to use the trained model to make predictions with wav files. How do I get the prediction? Can you please help?
Thanks!
Sivam.
Hi sir, can you please share pre-trained model or it's weights. :)
Hi,
I've been able to reproduce to a very close degree the results of the TIMIT experiment. However, I believe to reproduce the LibriSpeech results, I'll need a bit more information if you don't mind. I've currently downloaded the clean 100 and 360 hour datasets as well as the 500 hour "other" dataset. This has about 100 fewer speakers than the number you reference in your paper. Could you provide the names of the Libri datasets from which you drew your speakers?
How did you preprocess them? I know in the paper you mention using only 12-15s for training. Did you just taking the first 15s of each utterance (or less if the utterance was shorter)? If not, could you explain how you arrived at the training utterances as well as the preprocessing you've done?
Could you also provide the training/testing data lists similar to TIMIT_train.scp and TIMIT_test.scp?
Again, this is aimed at trying to reproduce the Libri results, so as much specificity as possible would be great!
Thanks for all you've done!
First, thanks for your contribution.
In your expriment, Classification Error Rates - CER% for speaker-id task and Equal Error Rate - EER% for speaker verification used, howeve, at present, deep feature representition and similarity scores were used to speaker recognition, so whether if should explain generalization of sincNet from A dataset training to B dataset testing.
Traceback (most recent call last):
File "/home/administrator/Videos/SincNet-master/TIMIT_preparation.py", line 45, in
in_folder = sys.argv[1]
IndexError: list index out of range
please give the solution to above mentioned error...
Just curious if you ever tested using a single optimizer for all the networks, and if so, was the performance worse? Similarly, have you run any experiments varying the learning rate or other optimizer parameters per layer?
Thanks!
Sean
Hello,
I can succeed in running this code with only one gpu. But I failed when I tried to use DataParallel to
call several gpus. How can I use several gpus to faster the training process?
Main code is listed:
CNN_net=CNN(CNN_arch)
CNN_net.cuda()
CNN_net=nn.DataParallel(CNN_net)
DNN1_net=MLP(DNN1_arch)
DNN1_net.cuda()
DNN1_net=nn.DataParallel(DNN1_net)
DNN2_net=MLP(DNN2_arch)
DNN2_net.cuda()
DNN2_net=nn.DataParallel(DNN2_net)
The error info like this:
Traceback (most recent call last):
File "speaker_id.py", line 180, in
DNN1_arch = {'input_dim': CNN_net.out_dim,
File "/Work19/2017/xxx/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 535, in getattr
type(self).name, name))
AttributeError: 'DataParallel' object has no attribute 'out_dim'
Dear Mirco,
I couldn't find the link to arxiv paper. When will it be available?
Thanks.
as described above
How to download the used model_raw.pkl
SincNet/exp/SincNet_TIMIT/model_raw.pkl ??
Hello,
First of all, thanks for all the great work :)
I managed to reproduce the results from the paper using the TIMIT dataset and I am now thinking about the following scenario:
I have a dataset of 500 speakers, I trained the model on it, I get a good enough accuracy and the model can reliably identify one of those 500 speakers from an audio sample. Now I need to add one or more new speakers, let's say 5; the desired outcome is a model that can identify one of the now 505 speakers. This could be a case that repeats in the future, as I get more audio data.
I currently have these approaches in mind:
Train the model from scratch every time I need to add new speakers. The disadvantage to this is that I don't leverage any accumulated knowledge from previous trainings.
Use transfer learning somehow - load the weights from the "500 speakers" trained model and replace the softmax layer with one that has 505 classes, then train a few more epochs.
Same as 2, except we also freeze all the layers except softmax.
How would you approach this? If 2 and 3 are viable options, how would you implement that? Would changing "class_lay" in the config to 505 and training with the new dataset be enough for 2? How would you approach freezing the non-softmax layers?
Thanks again,
Bogdan
I got this error while doing d_vector extraction. Is D-vector output of 2nd CNN layers? Error seems to be due to availability of Torch GPU version and not CPU. How did you handle in your case? Thanks! Below is error message:
import torch
File "anaconda3/lib/python3.6/site-packages/torch/init.py", line 80, in
from torch._C import *
ImportError: /lib64/libc.so.6: version `GLIBC_2.14' not found (required by /anaconda3/lib/python3.6/site-packages/torch/_C.cpython-36m-x86_64-linux-gnu.so)
After using first CNN layer (sincnet) to process the waveform , is it possible to convert the feature extracted back to the original waveform ? Thanks
Hi Mirco,
I'm curious if you had to make any adjustments to the structure of the model to handle the >2k speakers in LibriSpeech? I've been attempting to fit a model on >3k speakers, and the Sentence Classification Error doesn't drop below ~50%. I have ~3min of speech per speaker.
Could you provide the res.res file for LibriSpeech that you provided for TIMIT?
Hello,When I get this model:model_raw.pkl,Where SincNet is implemented?
As promised -- an update
Moved from mravanelli/pytorch-kaldi#88 to here at SyncNet
On pytorch-kaldi #88: I stated:
I am going to attempt to run the SyncNet speaker id experiment.
If it fails to run I will look into more hardware. So ...
Attempt to install on Raspberry Pi 3B+
Requirements:
Linux
Python 3.6/2.7
pytorch 1.0
pysoundfile
anaconda
Bottom line toward Requirements:
(I am skipping the installation output except for the last line)
Linux: Raspbian GNU/Linux 9 (stretch) *** YES ***
Python 3.6/2.7 *** NO *** I have Python 3.5.3 -- Anaconda installed 3.4
pi@raspberrypi:~ $ conda install anaconda-client
Anaconda *** Maybe *** The following NEW packages will be INSTALLED:
anaconda-client: 1.0.2-py34_0
clyent: 0.4.0-py34_0
freetype: 2.5.2-2
jpeg: 8d-0
libpng: 1.6.17-0
libtiff: 4.0.2-1
pillow: 2.9.0-py34_0
pip: 7.1.2-py34_0
python-dateutil: 2.4.2-py34_0
pytz: 2015.4-py34_0
setuptools: 18.1-py34_0
six: 1.9.0-py34_0
wheel: 0.24.0-py34_0
pi@raspberrypi:~ $ sudo apt-get install pytorch
pytorch E: Unable to locate package pytorch
pi@raspberrypi:~ $ pip3 install pytorch
Exception: You tried to install "pytorch". The package named for PyTorch is "torch"
pi@raspberrypi:~ $ sudo apt-get install torch
RuntimeError: PyTorch does not currently provide packages for PyPI (see status at pytorch/pytorch#566).
Please follow the instructions at http://pytorch.org/ to install with miniconda instead.
pi@raspberrypi:~ $ conda install pytorch=0.4.1 -c pytorch
pytorch *** NO *** Error: No packages found in current linux-armv7l channels matching: pytorch 0.4.1*
pysoundfile ( conda install -c conda-forge pysoundfile)
pi@raspberrypi:~ $ conda install -c conda-forge pysoundfile
Fetching package metadata: ......
Solving package specifications:
Error: Could not find some dependencies for pysoundfile: cffi
pi@raspberrypi:~ $ conda install --channel https://conda.anaconda.org/poppy-project cffi
The following packages conflict with each other:
cffi
python 3.4*
pysoundfile *** NO ***
Toward Requirements: 1.5 out of 5
************************* PLAN B *************************
Order a new system
https://developer.nvidia.com/embedded/buy/jetson-nano-devkit $99
NVIDIA® Jetson Nano™ Developer Kit is a small, powerful computer that lets you run multiple neural networks in parallel for applications like image classification, object detection, segmentation, and speech processing.
JetPack is compatible with NVIDIA’s world-leading AI platform for training and deploying AI software, and reduces complexity and effort for developers by supporting many popular AI frameworks, like TensorFlow, PyTorch, Caffe, and MXNet. It also includes a full desktop Linux environment and out-of-the-box support for a variety of popular peripherals, add-ons, and ready-to-use projects.
Technical Specifications
GPU 128-core Maxwell
CPU Quad-core ARM A57 @ 1.43 GHz
Memory 4 GB 64-bit LPDDR4 25.6 GB/s
Storage microSD (not included)
Video Encode 4K @ 30 | 4x 1080p @ 30 | 9x 720p @ 30 (H.264/H.265)
Video Decode 4K @ 60 | 2x 4K @ 30 | 8x 1080p @ 30 | 18x 720p @ 30 (H.264/H.265)
Camera 1x MIPI CSI-2 DPHY lanes
Connectivity Gigabit Ethernet, M.2 Key E
Display HDMI 2.0 and eDP 1.4
USB 4x USB 3.0, USB 2.0 Micro-B
Others GPIO, I2C, I2S, SPI, UART
Mechanical 100 mm x 80 mm x 29 mm
https://www.adafruit.com/product/1995 $8
5V 2.4A Switching Power Supply
https://www.sandisk.com/home/memory-cards
SanDisk microSDHC™ 64GB SD Card $13
Existing USB Keyboard, USB Mouse, HDMI screen
Toward Requirements:
Linux: Ubuntu 18.04 LTS
Python 3.6
PyTorch 1.1
Watched a tutorial on configuring, booting up, and running some examples.
And another on Introduction to Deep Learning
Unknown out of 5 (I will update you in a week / 10 days)
Table 2 of the paper compares the performance on the speaker verification task bewteen SincNet with d-vector and several other models. How's the d-vector extracted? Which of the following is correct?
The naive way. Namely a random chunk (200 ms, not necessarily comes from the same utterance) of the speaker's audio is pre-processed (\x -> x / abs(max(y))
, where x
is the current chunk and y
is whole signal), and then fed to the network, the output of the last hidden layer (i.e. the output of DNN1_net
) is obtained, do the same for other audio chunks (thus many audio chunks are consumed), one would get many vectors, denoted d_1, d_2, ..., d_n
, each d_i
is L2 normalized to get d'_i
and the final d-vector of this speaker is mean(d'_1, d'_2, ..., d'_n)
.
The "original" way. A speaker is represented by a sequence of utterance, {O_i: i}
, each utterance is consisted of a sequence of frames, {o_j: j}
, during enrollment/verification, each o_j
is companied by its context, (the original paper uses 30 frames to the left and 10 frames to the right), and fed to the network, the output of the last hidden layer is extracted, denoted with a_j
, a_j
is then L2 normalized, the d-vector of utterance O_i
is sum_{j} {a_j}
; the d-vector of the speaker is mean({d-vector of O_i: i})
The d-vector paper: https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/41939.pdf
PS. in both cases, I assume that the network used for extracting d-vector is the same one that is produced by speaker_id.py
.
my data is very very big, Utilization rate of the gpu(24G) is very low when trainning, 128 batch size used 1G of 24G, I try to add batch size to improve the utilization rate, But I worry about the impact on convergence, do you have some way which can improve the utilization rate of gpu? thank you
I believe what you call LayerNorm
is actually InstanceNorm1d
in pytorch.
Lines 112 to 123 in 488c982
It is my understanding that Layer Normalization would actually have one weight/bias per sample in the sequence, while Instance Normalization only has one per channel. Do you confirm?
Hello, I've reading your paper and I'm a little curious about the calculation of mutual infomation.
when transfer the MI to KL based method which contains a joint distribution and two marginal production.
Hi, I am trying to check the performance of SincNet on VoxCeleb dataset. I am wondering about the rationale of extracting chunks of 200ms windows of signals during training, and also the 10ms overlap that you have in test? Does the model depend on this?
Can I use longer chunks like 3s of audio like the VoxCeleb paper seems to be using? Seeing that VoxCeleb is a much larger dataset?
@mravanelli
Wanted to know if there is a port of the net on TF?
Regards,
Sivam
@mravanelli sir. is it possible for for each speakers x-vectors pass into sincnet and classifying a speaker. like speaker verification.?
i have my speakers x-vector embeddings (generated by kaldi benchmarked pretrained model). how can i apply speaker verification task for that using sincnet.
i am beginner for this one. i don't know how to do this?
sir any idea or suggestions.
Thank sir.
Hello,
I want to find out more details about TIMIT database (in particular .TXT, .PHN and .WRD files):
For example (in folder train/dr1/FCJF0).
File SI1657.TXT i have the following:
0 45466 Or borrow some money from someone and go home by bus?
Question: What does numbers '0' and '45466' refer to? Perhaps time duration in miliseconds?
File SI1657.WRD :
2120 3533 or
3533 8200 borrow
8200 12291 some
12291 15325 money
15325 18435 from
18435 25984 someone
25984 28960 and
28960 31000 go
31000 34599 home
34599 36200 by
36200 43480 bus
Question: What does numbers (in first two columns) refer to?
File SI1657.PHN (took a fragment) :
0 2120 h#
2120 2725 q
Question: What does numbers (0, 2120 and 2120, 2725) refer to?
Another question: Would SincNet work if no .phn (phonetics) files are provided to the dataset?
Best regards,
Andrius L.
Hi,
First of all, thanks for sharing your work :)
i use google colab to excute this project
when i run ''!python3 "/content/keras-sincnet/speaker_id.py" --cfg=/content/keras-sincnet/cfg/SincNet_TIMIT.cfg''
I get the following error:
File "/content/keras-sincnet/speaker_id.py", line 227, in
[inp,lab]=create_batches_rnd(batch_size,data_folder,wav_lst_tr,snt_tr,wlen,lab_dict,0.2)
File "/content/keras-sincnet/speaker_id.py", line 53, in create_batches_rnd
lab_batch[i]=lab_dict[wav_lst[snt_id_arr[i]]]
KeyError: 'TRAIN/DR1/FSJK1/SX305.WAV'
Can you give some guidelines about how to resolve this error
File "/home/administrator/Videos/SincNet-master/TIMIT_preparation.py", line 51, in
copy_folder(in_folder,out_folder)
File "/home/administrator/Videos/SincNet-master/TIMIT_preparation.py", line 36, in copy_folder
shutil.copytree(in_folder, out_folder, ignore=ig_f)
File "/home/administrator/Music/dell/lib/python3.5/shutil.py", line 303, in copytree
names = os.listdir(src)
NotADirectoryError: [Errno 20] Not a directory: '/home/administrator/Videos/SincNet-master/TIMIT_preparation.py'
need solution for the above mentioned error...
Thank you for your contribution.
In paper you are mentioning about the speaker verification performance but in code I did not found any code related to speaker verification. Please can you explain me how can i implement verification.
I got this error after 1st epoch. Please help I am confused here.
Thanks in advance
Archan
Hi, the trained model seems not work when compute_d_vector:
Missing key(s) in state_dict: "conv.0.low_hz_", "conv.0.band_hz_"
the keys in checkpoint_load['CNN_model_par']:
conv.0.filt_b1
conv.0.filt_band
conv.1.weight
conv.1.bias
conv.2.weight
conv.2.bias
bn.0.weight
bn.0.bias
bn.0.running_mean
bn.0.running_var
bn.1.weight
bn.1.bias
bn.1.running_mean
bn.1.running_var
bn.2.weight
bn.2.bias
bn.2.running_mean
bn.2.running_var
ln.0.gamma
ln.0.beta
ln.1.gamma
ln.1.beta
ln.2.gamma
ln.2.beta
ln0.gamma
ln0.beta
Thanks a lot.
Thanks for this great toolkit!
I've train the speaker-id model and I'm now trying to extract the d-vectors for various wav files.
I notice that if I print the final utterance d-vector, I get different vector values each time:
For example, on the first run, I get:
[0.00486447 0.01101663 0.00926225 ... 0.0081592 0.00329391 0.01262286]
And the 2nd time I start the script, I get
[-6.3974774e-05 -1.0634900e-02 -1.0657464e-04 ... 1.9263949e-02
-2.3915395e-03 4.1378587e-02]
And again:
[0.00949155 0.01689023 0.00099393 ... 0.01041446 0.01067957 0.01114707]
If I put twice the same audio file in the wav list, I get consistent values within a run, but always different across run.
Any clues? I get this either running on the GPU or the CPU.
Not an issue, but just wanted to post that you can further decrease the sentence classification error by reverberating each training call and including them in the training dataset (effectively doubling the training size). The error drops to 0%. I will try with Libri as well
torch.abs
is applied to the output of sinc-based convolutions.
Lines 312 to 313 in 488c982
Why is that? I coudn't find an explanation in the paper.
Hi Marco,
thanks for sharing this great work! I'm trying to visualize the weights of the sinc layer in time and frequency domain but I'm having trouble getting it right. Some filters don't look like a single bandpass but have different amplitudes for different frequencies (e.g., see the right example in the eleventh row in the figure below).
The code below just loads a trained model from the saved checkpoint, computes the filters in time domain, and visualizes them alongside their Fourier transform. I'm certain it's just a problem with visualizing them. If you could have a look or share the example from Figure 2 in the paper, that would be highly appreciated.
Thanks,
Benedikt
%matplotlib notebook
import matplotlib.pyplot as plt
import numpy as np
import math
import torch
sampling_frequency = 16000
num_filters = 80
filter_length = 251
frequency_scaling = sampling_frequency * 1.0
# Load trained model from checkpoint
torch_checkpoint = torch.load("exp/SincNet_TIMIT/model_raw.pkl", map_location='cpu')
low_hz = torch_checkpoint["CNN_model_par"]["conv.0.filt_b1"]
band_hz = torch_checkpoint["CNN_model_par"]["conv.0.filt_band"]
# Compute filter kernel. Except from renaming some variables, this should be the same as the original code.
def flip(x, dim):
xsize = x.size()
dim = x.dim() + dim if dim < 0 else dim
x = x.contiguous()
x = x.view(-1, *xsize[dim:])
x = x.view(x.size(0), x.size(1), -1)[:,
getattr(torch.arange(x.size(1) - 1, -1, -1), ('cpu', 'cuda')[x.is_cuda])().long(), :]
return x.view(xsize)
def sinc(band, t_right):
y_right = torch.sin(2 * math.pi * band * t_right) / (2 * math.pi * band * t_right)
y_left = flip(y_right, 0)
y = torch.cat([y_left, torch.autograd.Variable(torch.ones(1)), y_right])
return y
filters = torch.autograd.Variable(torch.zeros((num_filters, filter_length)))
N = filter_length
t_right = torch.autograd.Variable(torch.linspace(1, (N - 1) / 2, steps=int((N - 1) / 2)) / sampling_frequency)
min_freq = 50.0;
min_band = 50.0;
filt_beg_freq = torch.abs(low_hz) + min_freq / frequency_scaling
filt_end_freq = filt_beg_freq + (torch.abs(band_hz) + min_band / frequency_scaling)
n = torch.linspace(0, N, steps=N)
# Filter window (hamming)
window = 0.54 - 0.46 * torch.cos(2 * math.pi * n / N);
window = torch.autograd.Variable(window.float())
for i in range(num_filters):
low_pass1 = 2 * filt_beg_freq[i].float() * sinc(filt_beg_freq[i].float() * frequency_scaling, t_right)
low_pass2 = 2 * filt_end_freq[i].float() * sinc(filt_end_freq[i].float() * frequency_scaling, t_right)
band_pass = (low_pass2 - low_pass1)
band_pass = band_pass / torch.max(band_pass)
filters[i, :] = band_pass * window
filters = filters.view(num_filters, 1, filter_length)
filters = filters.detach().numpy()
# Visualize filter kernels (similar to https://gist.github.com/endolith/236567)
# Two filters and their Fourier transform per row.
num_cols = 4
num_rows = int(np.ceil(num_filters * 2 / num_cols))
fig, axes = plt.subplots(num_rows, num_cols, figsize=(9, 80))
for i in range(num_filters):
spatial_ax = axes[(i * 2) // 4, (i * 2) % 4]
frequency_ax = axes[(i * 2 + 1) // 4, (i * 2 + 1) % 4]
weights = filters[i, 0, :]
# Frequency computation
ampl = 1/N * np.abs(np.fft.rfft(weights))
# RFFT frequency bins
freqs = np.fft.rfftfreq(N, 1/sampling_frequency)
spatial_ax.plot(weights)
frequency_ax.stem(freqs, ampl)
fig.tight_layout()
Hi Micro, many thanks for the great work and sharing the code! It's super useful for the work I am doing. So for the speaker identification task, each training sample has the length of 3200 since fs=16000; cw_len=200; wlen=int(fs*cw_len/1000.00)=3200
. And for testing, voting over even smaller chunks are performed.
I have a speech classification task that, I think, it's the best to take the whole speech sequence as input without chunking. Currently, is it possible to use variable length input sequence for sincnet at all? If not, then I would pad each batch with zeros to the max length (in each batch). Would sincnet be affected with padded batches?
Oh btw, my speech sequences vary in length by a lot actually. What would be your suggestion Micro? Many thanks again!
I configured the following which I believe will allow me to load the software I need and run SincNet. The manufacturer said it would run but questioned why I bought so much memory and only one monitor. If you see holes that need plugging, I would appreciate any comments.
Thank you, Gerard
Workstation 7920
Intel Xeon Gold 6140 2.3GHz, 3.7GHz Turbo 18C, 10.4GT/s 3UPI, 25MB Cache, HT (140W) DDR4-2666
Ubuntu Linux 18.04
NVIDIA Quadro GV100, 32GB, 4 DP (Precision xx20 Towers)
384GB 12x32GB DDR4 2666MHz RDIMM ECC
M.2 256GB PCIe NVMe Class 40 Solid State Boot Drive
M.2 2TB PCIe NVMe Class 40 Solid State Drive
Keyboard, Mouse, Monitor
Ethernet LAN
I really appreciate you released the paper together with the source code. Also, I have a question on the performance gap between err_tr and err_te:
According to the result shown in res.res, after epoch 360, err_tr=0.009600, err_te=0.419954. The gap between training and validation's performance seems quite large. Does it mean the model suffers some kind of overfitting problem?
Hi Mirco,
Thanks for the great work! I was wondering if you plan to share the data preparation recipe for voxCeleb1 and librispeech that can allow us to reproduce other experiments from your paper.
Hello,When I was doing this step, the following problem occurred:
$ python TIMIT_preparation.py $TIMIT_FOLDER $OUTPUT_FOLDER data_lists/TIMIT_all.scp
Traceback (most recent call last):
File "TIMIT_preparation.py", line 41, in
out_folder=sys.argv[2]
IndexError: list index out of range
I hope you can give me some answers, thank you very much.
Hello!
Firstly, thanks for sharing the code of your paper, it's really a fantastic work!
But I'm quite confused when I'm going to test my own model. When we are going to test the model in Speaker Identification, we should divide the test set into two parts, some of the data is used for enrollment and the others is used for test.
For example, there's ten sentences of each speaker, maybe it's not appropriate to set nine of the sentences for enrollment and one for test, as the model may learn much from the nine sentence and it's easy to make a correct predition of the rest one during test. Thus, in this condition, the accuracy might be higher than it truly should be. But that's not what I want.
I read your code carefully, but didn't find the answer, sorry about that. :(
So could you please tell me the way to divide the test set?
Hi,
I wanted to run speaker_id with our own database. We will have train and test wav files. How do we go about doing this?
Thanks,
Sivam.
I really appreciate you released the paper together with the source code. Also, I have few questions:
Thank you!
Hi Mirco,
I have a task where I need to take a pre-trained SincNet and re-train it on our data. Both datasets are prepared according to your protocols. SincNet trained on first dataset is available as well. Now, I want to remove the output layer of trained network and use a new output corresponding to new speakers from Dataset 2. What scripts I need to modify to do it. Do you think such a strategy is good rather than combining both datasets and re-train on composite dataset. In real-world, we get more data after every few month and re-training could be time consuming. So, I want to test this strategy to initialized from a previously trained network hoping that is could converge faster. Thanks!
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.