Giter VIP home page Giter VIP logo

voxsrc2023's Introduction

VoxSRC-23

In this repository we provide the validation toolkit for the VoxCeleb Speaker Recognition Challenge 2023. The challenge consists of two main tasks, speaker verification and speaker diarisation, largely adopted from last year's development kit.

This repository contains submodules. Please use the following command to clone the repository:

git clone https://github.com/JaesungHuh/VoxSRC2023.git --recursive

Dependencies

pip install -r requirements.txt

Speaker Verification

Within speaker verification, we have three tracks (closed, open, and semi-supervised). We have two sets of validation and test data, for Track 1 & 2 and Track 3. The validation and test data for the challenge both consist of pairs of audio segments, and the task is to determine whether they are from the same speaker or from different speakers. Teams are invited to create a system that takes the test data and produces a list of floating-point scores, with a single score for each pair of segments.

Validation Data

This repository only provides the validation trial pairs for Track 1 and 2. For Track 3, please download using this link. The validation data consists of trial pairs of speech from the identities in the VoxCeleb1 dataset and additional data we've prepared. To download the audio files, please visit our website. (Each trial pair consists of two single-speaker audio segments and can be found in data/verif/VoxSRC2023_val.txt.

File Format

Your ouput file for scoring should be a single space-delimited text files containing pair of audio segments per line, each line containing three fields:

  • Score -- score; must be a floating point number. Higher score means the pairs of segments are likely to correspond to the same speaker.
  • FILE1 -- name of audio segment; should correspond to trials provided in data/verif/VoxSRC2023_val.txt
  • FILE2 -- name of audio segment; should correspond to trials provided in data/verif/VoxSRC2023_val.txt

For example:

0.007 034374.wav 053313.wav
0.030 019394.wav 002252.wav
-0.118 063113.wav 005437.wav

Also see data/verif/VoxSRC2023_val_score.txt for an example file.

The leaderboard for the challenge will compute two metrics, the Detection Cost and the Equal Error Rate (EER). Further details about the metrics are provided below. The file compute_min_dcf.py computes the detection cost. With the random scores provided you should get a Detection Cost of 0.2142.

python compute_min_dcf.py --p-target 0.05 --c-miss 1 --c-fa 1 data/verif/VoxSRC2023_val_score.txt data/verif/VoxSRC2023_val.txt

The file compute_EER.py computes the EER. With an example file provided you should get an EER of 4.095%.

python compute_EER.py --ground_truth data/verif/VoxSRC2023_val.txt --prediction data/verif/VoxSRC2023_val_score.txt

While the leaderboard will display both metrics, the winners of the challenge will be determined using the Detection Cost ALONE, which is the primary metric.

Metrics

Equal Error Rate

This is the rate used to determine the threshold value for a system when its false acceptance rate (FAR) and false rejection rate (FRR) are equal. FAR = FP/(FP+TN) and FRR = FN/(TP+FN)

where

FN is the number of false negatives

FP is the number of false positives

TN is the number of true negatives

TP is the number of true positives

Minimum Detection Cost

Compared to equal error-rate, which assigns equal weight to false negatives and false positives, this error-rate is usually used to assess performance in settings where achieving a low false positive rate is more important than achieving a low false negative rate. We follow the procedure outlined in Sec 3.1 of the NIST 2018 Speaker Recognition Evaluation Plan for AfV trials, and use the following parameters for the cost function:

  1. CMiss (cost of a missed detection) = 1
  2. CFalseAlarm (cost of a spurious detection) = 1
  3. PTarget (a priori probability of the specified target speaker) = 0.05

This is the PRIMARY metric for the challenge.

Baseline code

We provide the baseline model + code in baseline_verif.py. The code uses ECAPA-TDNN trained with VoxCeleb1 and 2. WARNING : Note that this model cannot be used for Track 1 since it is also trained with VoxCeleb1.

Speaker Diarisation

For speaker diarisation, we only have a single track. The goal is to break up multispeaker segments into sections of "who spoke when". In our case, each multispeaker audio file is independant (i.e. we will treat the sets of speakers in each file as disjoint), and the audio files will be of variable length. Our scoring code is obtained from the excellent DSCORE repo.

Validation Data

The validation data for this year is VoxConverse v0.3 dataset. (ver 0.3) Please visit this link to download the wavfiles and rttm files.

File Format

Your output file for scoring (as well as the ground truth labels for the validation set which we provide) must be a Rich Transcription Time Marked (RTTM) file.

Rich Transcription Time Marked (RTTM) files are space-delimited text files containing one turn per line, each line containing ten fields:

  • Type -- segment type; should always by SPEAKER
  • File ID -- file name; basename of the recording minus extension (e.g., abcde)
  • Channel ID -- channel (1-indexed) that turn is on; should always be 1
  • Turn Onset -- onset of turn in seconds from beginning of recording
  • Turn Duration -- duration of turn in seconds
  • Orthography Field -- should always by <NA>
  • Speaker Type -- should always be <NA>
  • Speaker Name -- name of speaker of turn; should be unique within scope of each file
  • Confidence Score -- system confidence (probability) that information is correct; should always be <NA>
  • Signal Lookahead Time -- should always be <NA>

For instance:

SPEAKER abcde 1   0.240   0.300 <NA> <NA> 3 <NA> <NA>
SPEAKER abcde 1   0.600   1.320 <NA> <NA> 3 <NA> <NA>
SPEAKER abcde 1   1.950   0.630 <NA> <NA> 3 <NA> <NA>

If you would like to confirm that your output RTTM file is valid, use the included validate_rttm.py script. We provide an example in data/diar/baseline_dev.rttm for Voxconverse dev set.

 python validate_rttm.py data/diar/baseline_dev.rttm

The file compute_diarisation_metrics.py computes both DER and JER. If you want to calculate the metrics with our baseline rttm file:

python compute_diarisation_metrics.py -r voxconverse/dev/*.rttm -s data/diar/baseline_dev.rttm

Metrics

Diarization Error Rate (DER)

The leaderboard will be ranked using the diarization error rate (DER), which is the sum of

  • speaker error -- percentage of scored time for which the wrong speaker id is assigned within a speech region
  • false alarm speech -- percentage of scored time for which a nonspeech region is incorrectly marked as containing speech
  • missed speech -- percentage of scored time for which a speech region is incorrectly marked as not containing speech

We use a collar of 0.25 seconds and include overlapping speech in the scoring. For more details, consult section 6.1 of the NIST RT-09 evaluation plan.

Jaccard error rate (JER)

We also report Jaccard error rate (JER), a metric introduced for DIHARD II that is based on the Jaccard index. The Jaccard index is a similarity measure typically used to evaluate the output of image segmentation systems and is defined as the ratio between the intersection and union of two segmentations.

Baseline code

There are several baseline models that you could run. Please take a look.

Further Details

Please refer to the Challenge Webpage for more information about the challenge.

Acknowledgements

The code from computing the DetectionCost is largely obtained from the excellent KALDI toolkit, and for computing the DER is from the excellent DSCORE repo. Please read their licenses carefully before redistributing.

voxsrc2023's People

Contributors

jaesunghuh avatar

Stargazers

 avatar  avatar  avatar

Watchers

 avatar

voxsrc2023's Issues

Suspected label errors in verif pairs

Firstly, many thanks for providing these valuable datasets!

I'd like to report a potential issue with some pairs in the verif/VoxSRC2023_val.txt file.
It seems that a few pairs may have incorrect labels.

Here are the pairs that are suspected of having falsely labeled data in the verif dataset:

1 011225.wav 060419.wav
1 060000.wav 048652.wav
1 010915.wav 049680.wav
1 003684.wav 018203.wav
1 017842.wav 061341.wav
1 020685.wav 039926.wav
1 014432.wav 048013.wav
1 007463.wav 018203.wav
1 043317.wav 020439.wav
1 029826.wav 044283.wav

While there may be additional cases, I have personally identified these ten through listening.
Could you review them and confirm whether they are correct?

Thank you 😃

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.