lingjzhu / charsiu Goto Github PK

View Code? Open in Web Editor NEW

268.0 268.0 33.0 1.2 MB

Charsiu: A neural phonetic aligner.

License: MIT License

Jupyter Notebook 77.90% Python 22.10%

charsiu's People

Contributors

Stargazers

Watchers

charsiu's Issues

How does the type of audio affect the predictive charsiu aligner. The audio is in wav format

Possible to create lab files using forced alignment?

Hello,

Thanks for releasing and supporting your model. Would it be possible to add .lab file support export ?

Does it support Chinese-English alignment ?

Enable to find the model in transformer's package ))) Any thoughts?

Reproducing Results for W2V2-FS-20ms

Based on the paper, I've successfully reproduced results for Charsiu's FC-10ms, textless FC-10ms, MFA, WebMaus, but I'm having trouble reproducing the pretrained FS-20ms model.
I first downloaded the charsiu/en_w2v2_fs_10ms from HuggingFace into my working directory.
Then I followed the tutorial for generating alignments. When I try

charsiu = charsiu_forced_aligner(aligner='charsiu/en_w2v2_fs_10ms')

, the results are complete gibberish and are nowhere near the paper's results.

When I try

charsiu = charsiu_attention_aligner(aligner='charsiu/en_w2v2_fs_10ms')

, the results are slightly better, but still not as good as that of the paper's.

My questions are:

Which one of the above lines should I be using when calling the fs_10ms aligner?
Is there perhaps a step I'm missing after downloading the model from HuggingFace?

Thanks so much!

Unable to run Chinese forced alignment model

May I possibly ask why there is always index error when I run the Chinese forced alignment model?
The snapshot is as follows. Thank you very much!

About training

experiments Those were original research code for training the model.
Good job. I want to pre-train on my Chinese dataset. I don't know whether the code in experiments is OK. If so, can you write the training instructions roughly?

What each aligner does

I keep forgetting which aligner does what so here it is:

Wav2Vec2ForFrameClassification is just w2v2 with a linear layer head.
charsiu_predictive_aligned takes the argmax of the logits of the Wav2Vec2ForFrameClassification model
charsiu_forced_aligner does g2p on a given transcript, then uses those sequence of phones to index into the logits of Wav2Vec2ForFrameClassification along the phone_id axis. Then DTW can be used on the resulting sequence vs time tensor find the alignment.

charsiu/src/utils.py

Lines 304 to 305 in 13a69f2

D,align = dtw(C=-cost[:,phone_ids],

step_sizes_sigma=np.array([[1, 1], [1, 0]]))
charisu_attention_aligner uses Wav2Vec2ForAttentionAlignment which uses w2v2 for encoding speech and a BERT for encoding phonemes, and then something really over engineered. The DTW is the correct way to normalize the output of w2v2, and it seems that Wav2Vec2ForAttentionAlignment only exists because DTW was overlooked. This should be depreciated?
charsiu_chain_forced_aligner does w2v2-c2c to get phonemes, then Wav2Vec2ForAttentionAlignment followed by DTW. Perhaps this should be replaced by the charsiu_forced_aligner where the phonemes are obtained from w2v2-c2c.

ETA German version?

I would love to give this a try ...

Bug in phoneme to word conversion -- duplicate words

Something seems to be not right with how SIL is used in the word transcriptions.

This is the first example in the LibriSpeech Test set.

Here is the true transcript:

HE BEGAN A CONFUSED COMPLAINT AGAINST THE WIZARD WHO HAD VANISHED BEHIND THE CURTAIN ON THE LEFT

Here is the forced aligned word transcript:

array([['0.0', '0.23', '[SIL]'],
       ['0.23', '0.33', 'he'],
       ['0.33', '0.65', 'began'],
       ['0.65', '0.69', 'a'],
       ['0.69', '1.21', 'confused'],
       ['1.21', '1.62', 'complaint'],
       ['1.62', '1.93', 'against'],
       ['1.93', '2.01', 'the'],
       ['2.01', '2.41', 'wizard'],
       ['2.41', '2.56', '[SIL]'],
       ['2.56', '2.57', 'wizard'],
       ['2.57', '2.63', '[SIL]'],
       ['2.63', '2.75', 'who'],
       ['2.75', '2.84', 'had'],
       ['2.84', '3.26', 'vanished'],
       ['3.26', '3.59', 'behind'],
       ['3.59', '3.66', 'the'],
       ['3.66', '4.02', 'curtain'],
       ['4.02', '4.15', 'on'],
       ['4.15', '4.23', 'the'],
       ['4.23', '4.66', 'left'],
       ['4.66', '4.89', '[SIL]']], dtype='<U32')

Here is the forced aligned phonetic transcript:

array([['0.0', '0.23', '[SIL]'],
       ['0.23', '0.3', 'HH'],
       ['0.3', '0.33', 'IY'],
       ['0.33', '0.39', 'B'],
       ['0.39', '0.44', 'IH'],
       ['0.44', '0.53', 'G'],
       ['0.53', '0.6', 'AE'],
       ['0.6', '0.65', 'N'],
       ['0.65', '0.69', 'AH'],
       ['0.69', '0.77', 'K'],
       ['0.77', '0.81', 'AH'],
       ['0.81', '0.86', 'N'],
       ['0.86', '0.97', 'F'],
       ['0.97', '1.02', 'Y'],
       ['1.02', '1.1', 'UW'],
       ['1.1', '1.16', 'Z'],
       ['1.16', '1.21', 'D'],
       ['1.21', '1.26', 'K'],
       ['1.26', '1.3', 'AH'],
       ['1.3', '1.34', 'M'],
       ['1.34', '1.44', 'P'],
       ['1.44', '1.49', 'L'],
       ['1.49', '1.55', 'EY'],
       ['1.55', '1.58', 'N'],
       ['1.58', '1.62', 'T'],
       ['1.62', '1.66', 'AH'],
       ['1.66', '1.74', 'G'],
       ['1.74', '1.78', 'EH'],
       ['1.78', '1.84', 'N'],
       ['1.84', '1.9', 'S'],
       ['1.9', '1.93', 'T'],
       ['1.93', '1.96', 'DH'],
       ['1.96', '2.01', 'AH'],
       ['2.01', '2.1', 'W'],
       ['2.1', '2.15', 'IH'],
       ['2.15', '2.26', 'Z'],
       ['2.26', '2.34', 'ER'],
       ['2.34', '2.41', 'D'],
       ['2.41', '2.56', '[SIL]'],
       ['2.56', '2.57', 'D'],
       ['2.57', '2.63', '[SIL]'],
       ['2.63', '2.7', 'HH'],
       ['2.7', '2.75', 'UW'],
       ['2.75', '2.78', 'HH'],
       ['2.78', '2.8', 'AE'],
       ['2.8', '2.84', 'D'],
       ['2.84', '2.95', 'V'],
       ['2.95', '3.04', 'AE'],
       ['3.04', '3.09', 'N'],
       ['3.09', '3.15', 'IH'],
       ['3.15', '3.23', 'SH'],
       ['3.23', '3.26', 'T'],
       ['3.26', '3.3', 'B'],
       ['3.3', '3.35', 'IH'],
       ['3.35', '3.43', 'HH'],
       ['3.43', '3.53', 'AY'],
       ['3.53', '3.56', 'N'],
       ['3.56', '3.59', 'D'],
       ['3.59', '3.62', 'DH'],
       ['3.62', '3.66', 'AH'],
       ['3.66', '3.78', 'K'],
       ['3.78', '3.9', 'ER'],
       ['3.9', '3.93', 'T'],
       ['3.93', '3.96', 'AH'],
       ['3.96', '4.02', 'N'],
       ['4.02', '4.09', 'AA'],
       ['4.09', '4.15', 'N'],
       ['4.15', '4.19', 'DH'],
       ['4.19', '4.23', 'AH'],
       ['4.23', '4.36', 'L'],
       ['4.36', '4.47', 'EH'],
       ['4.47', '4.58', 'F'],
       ['4.58', '4.66', 'T'],
       ['4.66', '4.89', '[SIL]']], dtype='<U32')

I suspect this may indicate a general problem with the phoneme to word conversion.

TextGrid file isn't according to spec

Could you check if .textgrid file produced is according to spec?

i'm using this https://github.com/nltk/nltk_contrib/blob/95d1806e2f4e89e960b76a685b1fba2eaa7d5142/nltk_contrib/textgrid.py to test generated textgrid files.

Can't currently support long audio ?

“Charsiu works the best when your files are shorter than 15 ms. Test whether your files are longer than 15ms”

I saw this hint in the description and tested it.

Forcing alignment of long audio, the following error message will appear:

Traceback (most recent call last): File "test.py", line 31, in <module> charsiu.align(audio=audio, text=text) File "E:\***/python/charsiu/charsiu/src\Charsiu.py", line 157, in align pred_words = self.charsiu_processor.align_words(pred_phones,phones,words) File "E:\***/python/charsiu/charsiu/src\processors.py", line 417, in align_words word_dur.append((dur,words_rep[count])) #((start,end,phone),word) IndexError: list index out of range

How do we use the pretrained attention aligner?

Hi, I find that getting a pretrained predictive aligner (aligner='charsiu/en_w2v2_fc_10ms') to work with librispeech seems straightforward. However, I'm unable to get the attention aligner working - how do I go about initializing the aligner and how do I get the corresponding bert config to go with it? Keeps throwing an error for the same.

wav2vec2-fs model for chinese alignment?

Hello, from your paper, it seems that the W2V2-FS‘s alignment is better than the W2V2-FC's, but now there is English W2V2_FS model only . Have you tested the W2V2-FS alignment Chinese? If you don't have a test, I'd like to train one to test
also I would like to know the specific steps of training. I have read your training code but I don't know what dataset I should used.
I'd like use W2V2-FS to replace the MFA
(sorry for my bad English)

Possible fixes from fork

Thanks for this great package! I forked the repo to tweak a few things to help my use case, and some of them might be useful to merge back into the master branch. I haven't submitted a PR because some of them might not be appropriate/desirable to merge, so I figured you could tell me which ones you want and I could clean up the code/add some tests if necessary and submit a PR then.

Fork is at https://github.com/nmfisher/charsiu

Changes are:

don't require sampling rate to be explicitly provided as librosa can resample to 16000Hz when loading a file
re-instate punctuation and insert the punctuation token, rather than silence, into the phone list
downweight silence to minimize erroneous insertion of silence in the middle of a word (this should probably be a parameter rather than a hardcoded 0.1)
ignore silence where left and right phones are identical (to completely avoid inserting silence frames in the middle of consecutive frames for a single phone). This works for me right now but needs a bit more thought because if phones are intentionally repeated (e.g. "ai ai"), this will fold silence between them into the left phone, so "ai [SIL] ai") will always becomes "ai ai". Solution is probably just to pass a parameter for a minimum silence duration (so if silence is greater than X, it's presered, otherwise it's folded into the left phone).

Scripts to reproduce results from paper?

Would you have the scripts to reproduce the results from the papers (I'm particularly interested in table 2), or maybe the procedures to reproduce them from this repo?

what is the phonetic table used for Chinese?

I guess it's pinyin, is there an official introduction?

	D,align = dtw(C=-cost[:,phone_ids],
	step_sizes_sigma=np.array([[1, 1], [1, 0]]))

lingjzhu / charsiu Goto Github PK

charsiu's People

Contributors

Stargazers

Watchers

Forkers

charsiu's Issues

Recommend Projects

Recommend Topics

Recommend Org