Giter VIP home page Giter VIP logo

3d-convolutional-speaker-recognition-pytorch's Introduction

alternate text

3D Convolutional Neural Networks for Speaker Verification - Official Project Page

image

image

image

Table of Contents

This repository contains the Pytorch code release for our paper titled as "Text-Independent Speaker Verification Using 3D Convolutional Neural Networks". The link to the paper is provided as well.

The code has been developed using Pytorch. The input pipeline must be prepared by the users. This code is aimed to provide the implementation for Speaker Verification (SR) by using 3D convolutional neural networks following the SR protocol.

image

Citation

If you used this code, please kindly consider citing the following paper:

@article{torfi2017text,
  title={Text-independent speaker verification using 3d convolutional neural networks},
  author={Torfi, Amirsina and Nasrabadi, Nasser M and Dawson, Jeremy},
  journal={arXiv preprint arXiv:1705.09422},
  year={2017}
}

General View

We leveraged 3D convolutional architecture for creating the speaker model in order to simultaneously capturing the speech-related and temporal information from the speakers' utterances.

Speaker Verification Protocol(SVP)

In this work, a 3D Convolutional Neural Network (3D-CNN) architecture has been utilized for text-independent speaker verification in three phases.

1. At the development phase, a CNN is trained to classify speakers at the utterance-level.

2. In the enrollment stage, the trained network is utilized to directly create a speaker model for each speaker based on the extracted features.

3. Finally, in the evaluation phase, the extracted features from the test utterance will be compared to the stored speaker model to verify the claimed identity.

The aforementioned three phases are usually considered as the SV protocol. One of the main challenges is the creation of the speaker models. Previously-reported approaches create speaker models based on averaging the extracted features from utterances of the speaker, which is known as the d-vector system.

How to leverage 3D Convolutional Neural Networks?

In our paper, we propose the implementation of 3D-CNNs for direct speaker model creation in which, for both development and enrollment phases, an identical number of speaker utterances is fed to the network for representing the spoken utterances and creation of the speaker model. This leads to simultaneously capturing the speaker-related information and building a more robust system to cope with within-speaker variation. We demonstrate that the proposed method significantly outperforms the d-vector verification system.

Dataset

Unlike the Original Implementaion, here we used the VoxCeleb publicy available dataset. The dataset contains annotated audio files. For Speaker Verification, the parts of the audio associated with the subject of interest, however, must be extracted from the raw audio files.

Three steps should be taken to prepare the data after downloading the data associated files.

  1. Extract the specific audio part that the subject of interest is speaking.[extract_audio.py]
  2. Create train/test phase.[create_phases.py]
  3. Voice Activity Detection to remove the silence. [vad.py]

Creating the dataset object, necessary preprocessing and feature extraction will be performed in the following data class:

class AudioDataset():
"""Audio dataset."""

    def __init__(self, files_path, audio_dir, transform=None):
        """
        Args:
            files_path (string): Path to the .txt file which the address of files are saved in it.
            root_dir (string): Directory with all the audio files.
            transform (callable, optional): Optional transform to be applied
                on a sample.
        """

        # self.sound_files = [x.strip() for x in content]
        self.audio_dir = audio_dir
        self.transform = transform

        # Open the .txt file and create a list from each line.
        with open(files_path, 'r') as f:
            content = f.readlines()
        # you may also want to remove whitespace characters like `\n` at the end of each line
        list_files = []
        for x in content:
            sound_file_path = os.path.join(self.audio_dir, x.strip().split()[1])
            try:
                with open(sound_file_path, 'rb') as f:
                    riff_size, _ = wav._read_riff_chunk(f)
                    file_size = os.path.getsize(sound_file_path)

                # Assertion error.
                assert riff_size == file_size and os.path.getsize(sound_file_path) > 1000, "Bad file!"

                # Add to list if file is OK!
                list_files.append(x.strip())
            except:
                print('file %s is corrupted!' % sound_file_path)

        # Save the correct and healthy sound files to a list.
        self.sound_files = list_files

    def __len__(self):
        return len(self.sound_files)

    def __getitem__(self, idx):
        # Get the sound file path
        sound_file_path = os.path.join(self.audio_dir, self.sound_files[idx].split()[1]

Code Implementation

The input pipeline must be provided by the user. Please refer to ``code/0-input/input_feature.py`` for having an idea about how the input pipeline works.

Input Pipeline for this work

image

The MFCC features can be used as the data representation of the spoken utterances at the frame level. However, a drawback is their non-local characteristics due to the last DCT 1 operation for generating MFCCs. This operation disturbs the locality property and is in contrast with the local characteristics of the convolutional operations. The employed approach in this work is to use the log-energies, which we call MFECs. The extraction of MFECs is similar to MFCCs by discarding the DCT operation. The temporal features are overlapping 20ms windows with the stride of 10ms, which are used for the generation of spectrum features. From a 0.8-second sound sample, 80 temporal feature sets (each forms a 40 MFEC features) can be obtained which form the input speech feature map. Each input feature map has the dimen-sionality of ζ × 80 × 40 which is formed from 80 input frames and their corresponding spectral features, where ζ is the number of utterances used in modeling the speaker during the development and enrollment stages.

The speech features have been extracted using [SpeechPy] package.

Implementation of 3D Convolutional Operation

The following script has been used for our implementation:

self.conv11 = nn.Conv3d(1, 16, (4, 9, 9), stride=(1, 2, 1))
self.conv11_bn = nn.BatchNorm3d(16)
self.conv11_activation = torch.nn.PReLU()
self.conv12 = nn.Conv3d(16, 16, (4, 9, 9), stride=(1, 1, 1))
self.conv12_bn = nn.BatchNorm3d(16)
self.conv12_activation = torch.nn.PReLU()
self.conv21 = nn.Conv3d(16, 32, (3, 7, 7), stride=(1, 1, 1))
self.conv21_bn = nn.BatchNorm3d(32)
self.conv21_activation = torch.nn.PReLU()
self.conv22 = nn.Conv3d(32, 32, (3, 7, 7), stride=(1, 1, 1))
self.conv22_bn = nn.BatchNorm3d(32)
self.conv22_activation = torch.nn.PReLU()
self.conv31 = nn.Conv3d(32, 64, (3, 5, 5), stride=(1, 1, 1))
self.conv31_bn = nn.BatchNorm3d(64)
self.conv31_activation = torch.nn.PReLU()
self.conv32 = nn.Conv3d(64, 64, (3, 5, 5), stride=(1, 1, 1))
self.conv32_bn = nn.BatchNorm3d(64)
self.conv32_activation = torch.nn.PReLU()
self.conv41 = nn.Conv3d(64, 128, (3, 3, 3), stride=(1, 1, 1))
self.conv41_bn = nn.BatchNorm3d(128)
self.conv41_activation = torch.nn.PReLU()

As it can be seen, slim.conv2d has been used. However, simply by using 3D kernels as [k_x, k_y, k_z] and stride=[a, b, c] it can be turned into a 3D-conv operation. The base of the slim.conv2d is tf.contrib.layers.conv2d. Please refer to official Documentation for further details.

License

The license is as follows:

APPENDIX: How to apply the Apache License to your work.

   To apply the Apache License to your work, attach the following
   boilerplate notice, with the fields enclosed by brackets "{}"
   replaced with your own identifying information. (Don't include the brackets!)  The text should be enclosed in the appropriate
   comment syntax for the file format. We also recommend that a
   file or class name and description of purpose be included on the
   same "printed page" as the copyright notice for easier
   identification within third-party archives.

Copyright {2017} {Amirsina Torfi}

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.

Please refer to LICENSE file for further detail.

Contribution

We are looking forward to your kind feedback. Please help us to improve the code and make our work better. For contribution, please create the pull request and we will investigate it promptly. Once again, we appreciate your feedback and code inspections.

references

SpeechPy

Amirsina Torfi. 2017. astorfi/speech_feature_extraction: SpeechPy. Zenodo. doi:10.5281/zenodo.810392.

3d-convolutional-speaker-recognition-pytorch's People

Contributors

astorfi avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

3d-convolutional-speaker-recognition-pytorch's Issues

Pre-trained weights?

Hi again, I've been working with this repo for a few days now. I tried training on the VoxCeleb dataset but I don't have a suitable GPU and it was going to take some insane length of time with the raw dataset with cuda=false... and then I noticed the files code/1-development/weights/net_final.pth and code/1-development/weights/net_epoch_170.pth.

I'm assuming the latter is a checkpoint file, but the net_final.path seemed to load OK into the enrollment phase. But, my results when running my own data through enrollment and evaluation do not seem super-accurate. So my question is, is the net_final.pth a reliably trained weights file? Or do I really have to train it using the raw VoxCeleb data?

I'm asking here becasue I couldn't see any info about it in the git log or on the README.

Requirements

@astorfi hi, firstly I would like to congrat you for your work, it is pretty amazing.

Would you mind sharing the library versions you used in this repo?

Why compute power_spectrum in AudioDataset.__getitem__()?

I am sorry if I missed something obvious, but it seems like the power_spectrum variable isn't used in the AudioDataset.getitem() function in https://github.com/astorfi/3D-convolutional-speaker-recognition-pytorch/blob/master/code/2-enrollment/DataProviderEnrollment.py?
Also, the file is read twice.

def __getitem__(self, idx):
        # Get the sound file path
        sound_file_path = os.path.join(self.audio_dir, self.sound_files[idx].split()[1])

        ##############################
        ### Reading and processing ###
        ##############################

        # Reading .wav file
        fs, signal = wav.read(sound_file_path)

        # Reading .wav file
        import soundfile as sf
        signal, fs = sf.read(sound_file_path)

        # Label extraction
        label = int(self.sound_files[idx].split()[0])

        ###########################
        ### Feature Extraction ####
        ###########################

        # DEFAULTS:
        num_coefficient = 40

        # Staching frames
        frames = speechpy.processing.stack_frames(signal, sampling_frequency=fs, frame_length=0.025,
                                                  frame_stride=0.01,
                                                  zero_padding=True)

        # # Extracting power spectrum (choosing 3 seconds and elimination of DC)
        power_spectrum = speechpy.processing.power_spectrum(frames, fft_points=2 * num_coefficient)[:, 1:]

        logenergy = speechpy.feature.lmfe(signal, sampling_frequency=fs, frame_length=0.025, frame_stride=0.01,
                                          num_filters=num_coefficient, fft_length=1024, low_frequency=0,
                                          high_frequency=None)

        ########################
        ### Handling sample ####
        ########################

        sample = {'feature': logenergy, 'label': label}

        ########################
        ### Post Processing ####
        ########################
        if self.transform:
            sample = self.transform(sample)

        return sample

Experimental results on large-size dataset

Good work!

It seems that the EER value (22.4%) for Google's TE2E model (Reference 14) reported in Table 3 are far from the counterpart (2.04%) in that paper. I do know that the experiements are carried out on different datasets.

However, do you have any further test on your own model when it is scaled to a very large datase?.

How to get VoxCeleb data

Hi @astorfi , thanks for this great resource. I've been trying to get it running but I'm really confused about how to get the VoxCeleb dataset. The website asks me for login credentials every time I try to download any file, but there's no way of signing up or requesting a password, and Google is throwing a blank when I try to find out why.

Do you happen to know if the dataset should be freely available? Or is it blocking me because I'm not on an academic IP address or something?

Any help appreciated, thanks!

POI utterance duration

I recently requested the voxceleb1 data and got access to it. But I couldn't find the POI start and end time values in any of the .txt files provided in the dataset. Any idea whether the dataset format changed and or I am missing anything here to get my research forward?

A question about the parameter in enrollment

output_numpy = np.zeros(shape=[num_enrollment,40,128],dtype=np.float32)
model = np.zeros(shape=[40,128],dtype=np.float32)
outputs = net(inputs) # shape is (batch_size,128)
output_numpy[i] = outputs.cpu().data.numpy()
Here‘s the code in enrollment.py. I'm confused about the parameter 40. It seems that it has to be equal to the batch size, or it does not work. But the batch size in the code is 64.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.