Giter VIP home page Giter VIP logo

speaker-diarization's Introduction

Speaker-Diarization

This project contains:

  • Text-independent Speaker recognition module based on VGG-Speaker-recognition
  • Speaker diarization based on UIS-RNN.
  • Mainly borrowed from UIS-RNN and VGG-Speaker-recognition, just link the 2 projects by generating speaker embeddings to make everything easier, and also provide an intuitive display panel

Prerequisites

  1. pytorch
  2. keras
  3. Tensorflow
  4. pyaudio (About how to install on windows, refer to pyaudio_portaudio)

Outline

1. Speaker recognition.

cd ghostvlad
python predict.py

The confusion matrix of 4 persons utterances is as below

0.00  0.32  0.40  | 0.70  0.62  0.76  | 0.81  0.83  0.76  | 0.92  0.83  0.89  |

0.32  0.00  0.48  | 0.68  0.58  0.76  | 0.87  0.84  0.83  | 0.92  0.82  0.86  |

0.40  0.48  0.00  | 0.71  0.65  0.74  | 0.79  0.81  0.72  | 0.90  0.84  0.85  |

********************************************************************************

0.70  0.68  0.71  | 0.00  0.35  0.30  | 0.78  0.81  0.76  | 0.80  0.81  0.80  |

0.62  0.58  0.65  | 0.35  0.00  0.45  | 0.76  0.71  0.73  | 0.82  0.77  0.77  |

0.76  0.76  0.74  | 0.30  0.45  0.00  | 0.83  0.83  0.80  | 0.83  0.84  0.80  |

********************************************************************************

0.81  0.87  0.79  | 0.78  0.76  0.83  | 0.00  0.40  0.46  | 0.76  0.80  0.86  |

0.83  0.84  0.81  | 0.81  0.71  0.83  | 0.40  0.00  0.45  | 0.80  0.78  0.82  |

0.76  0.83  0.72  | 0.76  0.73  0.80  | 0.46  0.45  0.00  | 0.85  0.85  0.84  |

********************************************************************************

0.92  0.92  0.90  | 0.80  0.82  0.83  | 0.76  0.80  0.85  | 0.00  0.41  0.44  |

0.83  0.82  0.84  | 0.81  0.77  0.84  | 0.80  0.78  0.85  | 0.41  0.00  0.41  |

0.89  0.86  0.85  | 0.80  0.77  0.80  | 0.86  0.82  0.84  | 0.44  0.41  0.00  |

********************************************************************************

Thanks to the authors of VGG, they are kind enough to provide the code and pre-trained model. Their paper can refer to UTTERANCE-LEVEL AGGREGATION FOR SPEAKER RECOGNITION IN THE WILD
It's a novel idea that combines netvlad/ghostvlad which popularly used in image recognition to speaker recognition, the state-of-the-art in the past was i-vector based, which depended on the GMM model and pLDA.

About VGG speaker model, I have re-implemented in tensorflow, ghostvlad-speaker and corresponding pretrained model.

This project only shows how to generate speaker embeddings using pre-trained model for uis-rnn training in later.
The training project link to VGG-Speaker-Recognition

Dataset

  1. http://www.openslr.org/38 contains 855 speakers and 120 utterances of Chinese Mandarin in each, so there are 102600 utterances in total.
  2. VCTK contains 109 speakers of English.
  3. VoxCeleb1 contains 1251 speakers.
  4. VoxCeleb2 contains 6112 speakers.
    How to generate speaker embeddings for the next training stage:
    python generate_embeddings.py
    You may need to change the dataset path by your own.

2. Speaker diarization.

diarization

Training

python train.py

The speaker embeddings generated by vgg are all non-negative vectors, and contained many zero elements. The uis-rnn seems abnormally deal with these data somehow, shows as below

Iter: 0  	Training Loss: nan    
Negative Log Likelihood: 7.3020	Sigma2 Prior: nan	Regularization: 0.0007
Iter: 10  	Training Loss: nan    
Negative Log Likelihood: nan	Sigma2 Prior: nan	Regularization: nan
Iter: 20  	Training Loss: nan    
Negative Log Likelihood: nan	Sigma2 Prior: nan	Regularization: nan

When I added an insignificate bias (e.g. 0.00001) to each element of vectors, error disappeared.

Iter: 0  	Training Loss: -581.8732    
Negative Log Likelihood: 7.0125	Sigma2 Prior: -588.8864	Regularization: 0.0007
Iter: 10  	Training Loss: -614.1193    
Negative Log Likelihood: 1.7536	Sigma2 Prior: -615.8737	Regularization: 0.0007
Iter: 20  	Training Loss: -644.9244    
Negative Log Likelihood: 1.7123	Sigma2 Prior: -646.6375	Regularization: 0.0007

Clustering

python speakerDiarization.py

The Result is showing as below:(3 speakers)

========= 0 =========
0:00.288 ==> 0:04.406
0:07.699 ==> 0:16.461
0:33.921 ==> 0:35.8
========= 1 =========
0:04.406 ==> 0:07.699
0:16.461 ==> 0:19.594
0:30.371 ==> 0:33.921
0:41.19 ==> 0:44.185
========= 2 =========
0:19.594 ==> 0:30.371
0:35.8 ==> 0:41.19

The final result is influenced by the size of each window and the overlap rate. When the overlap is too large, the uis-rnn perhaps generates fewer speakers since the speaker embeddings changed smoothly, otherwise will generate more speakers. And also, the window size cannot be too short, it must contain enough information to generate more discrimitive speaker embeddings.

speaker-diarization's People

Contributors

taylorlu avatar neozhangthe1 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.