Giter VIP home page Giter VIP logo

pyannote-db-plumcot's People

Contributors

aman-berhe avatar amanatgit avatar hbredin avatar lgalmant avatar paullerner avatar sharleynelefevre avatar

Stargazers

 avatar

Watchers

 avatar  avatar  avatar

pyannote-db-plumcot's Issues

API for NLP stuff

Before releasing the corpus, we should provide example on how to load transcripts (and their forced-aligned version when available).

This probably means that we need to implement a dedicated API for that in the pyannote.database plugin.

>>> from pyannote.database import get_protocol
>>> protocol = get_protocol('TheBigBangTheory.????.Transcription')
>>> for file in protocol.train():
...    transcription = file['transcription']
...    for line in transcription:
...       line['speaker']
...       line['text'] 
...       # other fields of interest? time?

deepspeech-gpu installation error

Looks like deepspeech-gpu 0.7.3 can no longer be installed (at least on my macOS).
Where is this dependency used? Can we make it optional?

pip install ./pyannote-db-plumcot
ERROR: Could not find a version that satisfies the requirement deepspeech-gpu==0.7.3 (from pyannote.db.plumcot==0+untagged.455.g0b364bc) (from versions: none)
ERROR: No matching distribution found for deepspeech-gpu==0.7.3 (from pyannote.db.plumcot==0+untagged.455.g0b364bc)

fix: TWD entities annotation

Example with TheWalkingDead.Season01.Episode02

Final format

Mix-up

Here "mom" is annotated as "theodore_douglas"

referenceTo vs labelDoccano

Here "Lori" is not annotated in "labelDoccano" but is (automatically) in "referenceTo"

Also there "he" is annotated as "UNKNOWN" in "referenceTo" (and absent in "labelDoccano") whereas in this case it's easy to see that "he" refers to "shane".

Doccano annotation (csv and json)

Mix-up

The first two lines come from nowhere, they're not in the original transcript

That explains the mix-up described above.

Miss

I'm aware that the json format is not user-friendly, but there you can see that "mom" is annotated as "lori_grimes", but "lori" and "amy" are not annotated

cc @sharleynelefevre

refactor: entities.py script (semi-automatic annotation)

This is also intended as a documentation for the current script (more or less translated from @sharleynelefevre).
See also #13 for thoughts about a better annotation.

General stuff

  • script usage should follow the others: entities.py [--uri=<uri>] -> defaults to process all series. You currently have to hard-write the serie uri and season.
  • all paths should be relative to Plumcot/data like in the other scripts, this can be done using the path of the installed Plumcot package in a few lines of code:
import pyannote.database
import Plumcot as PC
from pathlib import Path
DATA_PATH=Path(PC.__file__).parent / "data"

You currently have to hard-write input path and output path. Note that the output path is not consistent with the actual structure and should be Plumcot/data/<serie>/csv_semi-auto-annotation

  • formats should have clearer names than .csv, or, better, follow standard formats (e.g. .conll)

semi_auto_loc_annotation

Input

transcript with speaker names (e.g. Plumcot/data/TheBigBangTheory/transcripts/TheBigBangTheory.Season01.Episode01.txt)

Output

semi-automatically annotated transcript in a .conll and a .csv formats.

Process

removeTabLines

Input

.conll file (previous output)

Output

.conll file without tabs at the end of the line so it's suitable for Doccano

Annotation in Doccano

Note this is not in the entities script. You need to:

  • select "sequence labeling" task
  • Check the second box
  • Import the .conll
  • Annotate
  • Export the data in JSONL format (the one on the right), rename the file with the same format as the title and move the file to your folder where the .csv / .conll outputs are located.

See CONTRIBUTING.md for annotation instructions.

Input

.conll file (previous output)

Output

.json1 files, manually annotated (i.e. corrected)

jsonToCSV

Input

.json1 file (previous output)

Output

.csv file. Note that this has nothing to do with the previous .csv format: this one only has three fields : "idSent", "idChar" and "label".

mergeData

Input

Output

Yet another .csv file which merges the two inputs. This is the format described in CONTRIBUTING.md and which is used in Plumcot.loader.

Process

Note this is insanely slow, roughly 15 minutes to merge 2 files in one.

stats

Input

.csv file (previous output)

Output

Evaluation of the automatic annotation

Calling file['entity'] raises "AlignmentError"

How to reproduce?

from pyannote.database import get_protocol
protocol = get_protocol('TheBigBangTheory.SpeakerDiarization.0')
test_file = next(protocol.test())
test_file['entity']
# AlignmentError: [...] are different texts.

Is this a known issue?
Is this related to your suggested changes in #13?

Entities annotation: improvements for the future

These would prevent a lot of hacks currently present in https://github.com/PaulLerner/pyannote-db-plumcot/blob/video/Plumcot/loader/loader.py and https://github.com/PaulLerner/pyannote-db-plumcot/blob/video/scripts/ner.py :

  • don't mess with spaCy tokenization
  • save in a format that allows to reconstruct the original text after tokenization (e.g. whitespace for each token which represents the whitespace following the token, might be '')
  • directly add forced-alignment attributes (word timestamp + alignment confidence) when tokenizing/annotating (although it's not straightforward what the attributes of a word split into several tokens should be)
  • correct entity type: can be automatic if we annotate only person named-entities
  • differentiate entities which are called by their names (and thus should be detected by a typical NER system) and others (e.g. pronouns etc.): maybe correcting the POS tag might be enough
  • better yet: annotate co-reference, this would allow to evaluate co-reference systems easily but I guess it's a lot of additional work

fix: BreakingBad and GameOfThrones annotations (Serial Speakers)

There might be a need to make BreakingBad and GameOfThrones annotations uniform:

  • "who speaks when" (.rttm) annotations come from the Serial Speakers dataset (Bost. et al) who did not use the same IMDb-speaker-identifier as us.
  • transcripts (.txt), forced-alignment (.rttm, .aligned) and entities (.csv) annotations come from the same pipeline as the rest of the corpus (fan transcripts + IMDb identifier for every speaker), although we don't have BreakingBad transcripts with "who said what"

Note: it might be easier to do that using the original serial speaker dataset and not the rttm version, see https://github.com/PaulLerner/Prune#convert

fix: transcripts

  • remove didascalies
  • remove "previously in ......."
  • in some rare case, fans transcribed l instead of I (e.g. l am a TV character.). This led to poor POS-tagging and is fixed in Plumcot.loader.CsvLoader.

fix: IMDb credits

Since 123e37c, a 1-1 mapping is done between the actor and the character so that an actor can only play one character. This was done to circumvent the fact that actors could be credited several times in movie series (e.g. TheLordOfTheRings) and then it would broke the episodes.py script that generates the credits.txt file from characters.txt and episodes.txt. Only now it's tricky to update the annotations prior to 123e37c (as some names have changed), see #19
Another solution would be to refactor the episodes.py script and the credits.txt file (I'm not sure why we have this huge unreadable matrix in the first place) so that one actor can play several characters and a character can be played by several actors. However this implies fixing all usage of the credits.txt file...

We could then:

  • close #19

  • revert:

  • fix TheLordOfTheRings, either:

    • revert 2e94797 and update the names in the annotations (.aligned and .rttm)
    • or keep the names from 123e37c
  • same for BuffyTheVampireSlayer but it's a lot easier, the only name that has changed is janicejanice_penshaw (which is inexistent for all I know)

  • update README and CONTRIBUTING

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.