Giter VIP home page Giter VIP logo

speaker-diarization's Introduction

Speaker Diarization scripts README

This README describes the various scripts available for doing manual segmentation of media files, for annotation or other purposes, for speaker diarization, and converting from-to the file formats of several related tools.

The scripts are either in python2 or perl, but interpreters for these should be readily available.

Please send any questions/suggestions to [email protected]

Quick Start Using Docker

A pre-built docker container can be used to run the the scripts.

docker pull blabbertabber/aalto-speech-diarizer

In the following example, we use the container to diarize a meeting.wav file:

docker run -it blabbertabber/aalto-speech-diarizer bash
cd /speaker-diarization
curl -k -OL https://nono.io/meeting.wav  # sample .wav; substitute yours
./spk-diarization2.py meeting.wav        # substitute your .wav filename
cat stdout                               # browse output

Installation instructions

Most of these scripts depend on the aku tools that are part of the AaltoASR package that you can find here. You should compile that for your platform first, following these instructions.

In this speaker-diarization directory:

  • Add a symlink to the folder AaltoASR/
  • Add a symlink to the folder AaltoASR/build
  • Add a symlink to AaltoASR/build/aku/feacat
  • Make sure the ffmpeg executable is on path or add a symlink to it too.

For example, if you have cloned and built AaltoASR into the ../AaltoASR path (relative to speaker-diarization):

speaker-diarization$ ln -s ../AaltoASR ./
speaker-diarization$ ln -s ../AaltoASR/build ./
speaker-diarization$ ln -s ../AaltoASR/build/aku/feacat ./

Would work.

You probably want to use spk-diarization2.py since that one calls the 2 versions of some scrips, while spk-diarization.py uses an old, matlab-based VAD that is hard to configure and deprecated.

mseg.py

Script to help perform manual segmentation of a media file, it can be any media file type supported by mplayer. It's only dependency is a Python-mplayer wrapper that can be installed locally by executing:

$ pip install --user mplayer.py

After that executing it is just:

$ ./mseg.py /path/to/mediafile -o outputfile

The output file is optional. It also supports the invocation:

$ ./mseg.py /path/to/mediafile -o outputfile -i inputfile

To continue a previously saved segmentation session. Once in the program, the controls are:

  • Quit: esc or q
  • Pause: p
  • Mark position: space
  • Manually edit mark: e
  • Add manual mark: a
  • Remove mark: r
  • Faster speed: Up
  • Slower speed: Down
  • Rewind: Left
  • Fast Forward: Right
  • Scroll down marks: pgDwn
  • Scroll up marks: pgUp

The media file starts as paused, so to start reproduction just hit the p key.

mseg2elan.py

Script to convert from mseg output to Elan file format.

Usage:

$ ./mseg2elan.py msoutputfile -o outputfile

If outputfile is not specified, the output will be sent to the stdout. Once in Elan, segments can be easily fine tuned by changing to the segmentation mode, in Options->Segmentation Mode.

aku2elan.py

Script to convert from AKU recipes to Elan file format.

Usage:

$ ./aku2elan.py recipe -o outputfile

If outputfile is not specified, the output will be sent to the stdout. Once in Elan, segments can be easily fine tuned by changing to the segmentation mode, in Options->Segmentation Mode.

elan2aku.py

Script to convert from Elan file format to AKU recipes.

Usage:

$ ./elan2aku.py elanoutputfile -o akurecipe

If akurecipe is not specified, the output will be sent to the stdout.

mseg_to_textgrid.pl

Script to convert from mseg output to praat file format.

Usage:

$ perl mseg_to_textgrid.pl msfile > outputfile

If outputfile is not specified, the output will be sent to the stdout.

voice-detection2.py

Creates an AKU recipe from the generate_exp.py output (.exp files).

For full help, use:

$ ./voice-detection2.py -h

vad-performance.py

Rates the performance of a Voice Activity Detection recipe in AKU format, such as those created with voice-detection.py. To measure the performance, another recipe with ground truth should be provided.

For full help, use:

$ ./vad-performance.py -h

spk-change-detection.py

Performs speaker turn segmentation over audio, using a distance measure such as GLR, KL2 or BIC, and sliding or growing window. It requires an input recipe file in AKU format pointing to the audio files, and preferably with turns of speech/non-speech already processed, and a features file for each wav to process, in the format outputted by the feacat program of the AKU suite.

For full help, use:

$ ./spk-change-detection.py -h

spk-change-performance.py

Rates the performance of a speaker turn segmentation recipe in AKU format, such as those created with spk-change-detection.py. To measure the performance, another recipe with ground truth should be provided.

For full help, use:

$ ./spk-change-performance.py -h

spk-clustering.py

Performs speaker turn clustering over audio. It requires a speaker segmentation recipe in AKU format, such as those created with spk-change-detection.py, and a features file for each wav file to process, in the format outputted by the feacat program of the AKU suite.

For full help, use:

$ ./spk-clustering.py -h

spk-time.py

Calculates per-speaker speaking time from a speaker-tagged recipe in AKU format.

For full help, use:

$ ./spk-time.py -h

spk-diarization2.py

Performs full speaker diarization over media file. If the media is not a wav file it tries to convert it to wav using ffmpeg. It then calls generate_exp.py, voice-detection.py, spk-change-detection.py and spk-clustering.py in succession.

For full help, use:

$ ./spk-diarization2.py -h

Notes:

  • Paths for the other scripts and features must be provided.
  • Since this script is a convenient wrapper for the other scripts of the family, it doesn't have options for all the settings of the other scripts, just some defaults. If you want to tune them, edit this script directly.
  • Some scripts have a 2 version. Usage of that one is preferable.

Contributors

Brendan Cunnie (@saintbrendan, [email protected]) and Brian Cunnie (@cunnie, [email protected]) contributed the Dockerfile. Tran Tu (@tran2, [email protected]) added ffmpeg to it for non-wav files support.

speaker-diarization's People

Contributors

antoniomo avatar cunnie avatar saintbrendan avatar tran2 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

speaker-diarization's Issues

Questions about BIC distance calculation

Hi,

I am confused about the BIC distance calculation here:
d = 0.5 * N * np.log(det(S)) - 0.5 * N1 * np.log(det(S1))- 0.5 * N2 * np.log(det(S2))
As in speaker_clustering.py Line 95 and 96 .

As far as I know, BIC calculation is based on log-likelihood of each sample over its model. What's the relation between the determinant (det(S), det(S1), det(S2)) and the sample probabilities? Here is the equation I know for BIC calculation:
image

I think there must be some theory background of this calculation, but I failed to make it.
Could anyone help me on this question. Thanks.

OSError: [Errno 2] No such file or directory

./mseg.py "/home/User/Desktop/Audio Files/51-Sleep.mp3"

Traceback (most recent call last):
File "./mseg.py", line 280, in
main(args.infile, marks, None)
File "./mseg.py", line 90, in main
p = mp.Player()
File "/home/User/.local/lib/python2.7/site-packages/mplayer/core.py", line 110, in init
self.spawn()
File "/home/User/.local/lib/python2.7/site-packages/mplayer/core.py", line 324, in spawn
close_fds=(sys.platform != 'win32'))
File "/usr/lib/python2.7/subprocess.py", line 710, in init
errread, errwrite)
File "/usr/lib/python2.7/subprocess.py", line 1335, in _execute_child
raise child_exception
OSError: [Errno 2] No such file or directory

I have faced this error however I am unable to detect its solution.

Please suggest that exactly how can I start with the execution of application to recognize speakers in source audio file.

Excessive CPU

I'm not sure if this is a bug, but I tried running the docker diarization example on a 1 hour audio containing 4 speakers. The spk-diarization2.py script ran for 3 hours consuming 100% of my CPU, and still hadn't completed, so I had to kill the container and abort the evaluation. Is that normal performance or should it have completed in that time?

I've tested other diarization tools on the same file, and they normally complete in about 1/4 of the file's length, or about 15 minutes for a 1 hour long file.

Can I recognize only one speaker?

Spk-diarization2.py
You can put a lot of audio in the speaker speak time to identify
Whether he can only identify a specific speaker to improve the accuracy rate
I only need to identify a speaker

Feacat acoustic model config is used to configure this?

Looking forward to your reply
Thank you!

Building my own model

Dear developers,

I want to build my own hmm model based on my own training data. Is training code included there? How can I do that.

Best Regards

UnboundLocalError: local variable 'feas' referenced before assignment

While trying to execute the below command ..

python spk-diarization2.py /mnt/c/users/karthikeyan/Downloads/proper.wav

am getting,

Reading file: /mnt/c/users/karthikeyan/Downloads/proper.wav
Writing output to: stdout
Using feacat from: /home/userk/speaker-diarization/feacat
Writing temporal files in: /tmp
Writing lna files in: /home/userk/speaker-diarization/lna
Writing exp files in: /home/userk/speaker-diarization/exp
Writing features in: /home/userk/speaker-diarization/fea
Performing exp generation and feacat concurrently
Traceback (most recent call last):
File "./generate_exp.py", line 37, in
from docopt import docopt
ImportError: No module named docopt
Calling voice-detection2.py
Reading recipe from: /tmp/initrypiaG.recipe
Reading .exp files from: /home/userk/speaker-diarization/exp
Writing output to: /tmp/vadHJVgzE.recipe
Sample rate set to: 125
Minimum speech turn duration: 0.5 seconds
Minimum nonspeech between-turns duration: 1.5 seconds
Segment before expansion set to: 0.0 seconds
Segment end expansion set to: 0.0 seconds
Error, /home/userk/speaker-diarization/exp/proper.exp does not exist
Waiting for feacat to end.
Calling spk-change-detection.py
Reading recipe from: /tmp/vadHJVgzE.recipe
Reading feature files from: /home/userk/speaker-diarization/fea
Feature files extension: .fea
Writing output to: /tmp/spkcM3EdlF.recipe
Conversion rate set to frame rate: 125.0
Using a growing window
Deltaws set to: 0.096 seconds
Using BIC as distance measure, lambda = 1.0
Window size set to: 1.0 seconds
Window step set to: 3.0 seconds
Threshold distance: 0.0
Useful metrics for determining the right threshold:

Maximum between windows distance: 0
Total windows: 0
Total segments: 0
Maximum between detected segments distance: 0
Total detected speaker changes: 0
Calling spk-clustering.py
('===', '/tmp/spkcM3EdlF.recipe')
Reading recipe from: /tmp/spkcM3EdlF.recipe
Reading feature files from: /home/userk/speaker-diarization/fea
Feature files extension: .fea
Writing output to: stdout
Conversion rate set to frame rate: 125.0
Using hierarchical clustering
Using BIC as distance measure, lambda = 1.3
Threshold distance: 0.0
Maximum speakers: 0
('::::::::::::::::::::::::::::::::::', 0)
Initial cluster with: 0 speakers
Traceback (most recent call last):
File "./spk-clustering.py", line 432, in
process_recipe(parsed_recipe, speakers, outf)
File "./spk-clustering.py", line 293, in process_recipe
spk_cluster_m(feas[1], recipe, speakers, outf, dist, segf)
UnboundLocalError: local variable 'feas' referenced before assignment

I tried looking into spk-clustering.py . the len(receipe) and feas values are 0....
thank you,

Docker Image for Speech Diarization

@antoniomo :

@saintbrendan and I built a Docker container to host your speech diarizer application to make it easy for people to get up-and-running quickly with the Aalto speaker-diarization software.

We'd like you to consider incorporating the Dockerfile into the repository, possibly even creating an aalto-speech Docker Hub organization which has an automated build which builds the Docker container. In other words, we'd like to give you the Dockerfile and have you maintain it/be responsible for it. Don't misunderstand โ€” we'd be happy to maintain the Dockerfile ourselves; it's just that we don't want to encroach on something that really belongs to you.

Here are some options:

  • We could submit a pull-request to place the Dockerfile in the speaker-diarization repo (in docker/Dockerfile) and modify the README.md to include instructions to use the Dockerfile.

  • We could additionally walk you through the creation of a Docker organization with an automated container build. We could remote-pair to turbo-charge the process (it should take ~20 minutes).

  • Do nothing. My brother and I would continue to happily maintain the Docker container.

Here's how easy it is to diarize a meeting in 4 easy steps with the Docker container (it's much easier than trying to install the python/pip/numpy/scipy, etc...). And it's an especially useful for Windows users.:

docker run -it blabbertabber/aalto-speech-diarizer bash
cd /speaker-diarization
curl -k -OL https://nono.io/meeting.wav  # sample .wav; substitute yours
./spk-diarization2.py meeting.wav        # substitute your .wav filename
cat stdout                               # browse output

Dockerfile is missing the step to install ffmpeg

I had to run these extra commands to get ffmpeg (for converting to wav)

dnf install https://download1.rpmfusion.org/free/fedora/rpmfusion-free-release-$(rpm -E %fedora).noarch.rpm https://download1.rpmfusion.org/nonfree/fedora/rpmfusion-nonfree-release-$(rpm -E %fedora).noarch.rpm
yum install ffmpeg

DockerHub-hosted image not up to date

PR #11 added the missing dependency ffmpeg to the Dockerfile, but did not deploy a new image to the DockerHub repo mentioned in the README (blabbertabber/aalto-speech-diarizer). So the latest image in that repo does not incorporate the fix (it's 11 months old).

I'm not sure if the owner of that repo is associated with this project, but it would be great to update the image, or at least update the README with a warning (I'd be happy to do that).

Thanks!

separate `.wav` files per speaker

I am looking to generate separate .wav files for each speaker after diarization.

Is this an existing feature, or does this need to be built?

I would like to build this feature, which consumes the output from spk-diarization2.py and generate .wav files for each speaker.

Issue with Running Audio File

Hi,
I was wondering if you could help me. I am trying to run an mp3 file, but I get this error.
Traceback (most recent call last):
File "./spk-diarization.py", line 64, in
print '%s does not exist, exiting' % args.feacat
AttributeError: 'Namespace' object has no attribute 'feacat'

I want to make a wav file using stdout, how can i make?

Hello. I'm gohn and your source is very great!
I use docker and run .sh file and make stdout.

audio=meeting.wav lna=a_1 start-time=0.384 end-time=5.82 speaker=speaker_1
audio=meeting.wav lna=a_2 start-time=5.82 end-time=31.648 speaker=speaker_2
audio=meeting.wav lna=a_3 start-time=31.648 end-time=58.272 speaker=speaker_1
audio=meeting.wav lna=a_4 start-time=60.032 end-time=66.536 speaker=speaker_1
audio=meeting.wav lna=a_5 start-time=66.536 end-time=68.748 speaker=speaker_2
audio=meeting.wav lna=a_6 start-time=68.748 end-time=70.576 speaker=speaker_2
audio=meeting.wav lna=a_7 start-time=70.576 end-time=78.264 speaker=speaker_2
audio=meeting.wav lna=a_8 start-time=79.84 end-time=80.248 speaker=speaker_2
audio=meeting.wav lna=a_9 start-time=80.248 end-time=82.792 speaker=speaker_2
audio=meeting.wav lna=a_10 start-time=82.792 end-time=83.372 speaker=speaker_2
audio=meeting.wav lna=a_11 start-time=83.372 end-time=88.96 speaker=speaker_2
audio=meeting.wav lna=a_12 start-time=88.96 end-time=93.288 speaker=speaker_1
audio=meeting.wav lna=a_13 start-time=93.288 end-time=93.9 speaker=speaker_2
audio=meeting.wav lna=a_14 start-time=93.9 end-time=96.436 speaker=speaker_1
audio=meeting.wav lna=a_15 start-time=96.436 end-time=98.436 speaker=speaker_2
audio=meeting.wav lna=a_16 start-time=98.436 end-time=102.736 speaker=speaker_2
audio=meeting.wav lna=a_17 start-time=102.736 end-time=103.284 speaker=speaker_2
audio=meeting.wav lna=a_18 start-time=103.284 end-time=103.888 speaker=speaker_2

But, i want to make a file that each speaker's speaking file.
speaker_1/~~~.wav
speaker_2/~~~.wav

Your source have this method or other method to make each wav file?

WAV files format

When using your wav example when using spk-diarization2.py everything works pretty good.
However when I try to use on my WAV files I get strange results. I try to compare WAV properties
between my files and yours:
Input #0, wav, from 'meeting.wav':
Duration: 00:09:25.60, bitrate: 256 kb/s
Stream #0:0: Audio: pcm_s16le ([1][0][0][0] / 0x0001), 16000 Hz, 1 channels, s16, 256 kb/s
I manipulated my wav file to be with the same properties using ffmpeg but still it doesn't work.

I cant understand what more can I do. Is there any manipulation I need to do to the WAV files before using the script?

Tunable Parameters

Hello! Awesome repository, thanks for making it available!

I have two questions about parameters used to perform speaker diarization:

The algorithm usually detects 1 speaker when two female speakers are talking on a telephone conversation. Are there any tunable parameters that we could change to get better performance on a specific dataset?

Also, my audios' sample rate is 8kHz. I tried to change fconfig.cfg sample_rate parameter to 8000, but the code still throws an exception:

exception: Audio file sample rate (8000 Hz) and model configuration (16000 Hz) don't agree.

As an (ugly but really fast) turnaround, I changed every file I found "16000" to "8000", recompiled everything, and then the code runs normally, but I'm not certain if speaker diarization is really working. Which config file should I change sample rate?

Thanks!

Speed is too slow. Is this docker can use gpu?

Hello. I answer again.
I use your library now, using docker.

But, it is too slow.
I put my wav file about 250MB, but it use so many cpu resource and it takes more than 2 hours. (Actually, it is still running. So, i don't know when is the finish. )

Can this library using gpu??
How can i made this work fast more than now??

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.