Giter VIP home page Giter VIP logo

Comments (11)

kormoczi avatar kormoczi commented on August 12, 2024

Two more things...

  1. I have tried to get the IPA phonemes of a Hungarian sentence with espeak-ng, but that gives different IPA characters as well. Maybe do you have some chart or conversion table for this? (I could add the additional characters to the language's inventory, but then I assume, there will be a lot of confusions...)
  2. In the text files (for fine-tuning), do we need to include spaces between the words, or any other special signs (e.g. ,.?!)
    Thanks

from allosaurus.

xinjli avatar xinjli commented on August 12, 2024

For the phoneme, you do not need to use that exact phoneme inventory, most of them are a standard phoneme + some diacritics attaching to it. For the diacritics, you can find info, for example, here

You can use much simpler phoneme inventory if that satisfies your purpose. Actually, I think that the default phoneme inventory is hard to recognize.

I am not very familiar with espeak-ng's inventory. if it is x-sampa format, you can convert them using this file from panphon

In the fine-tuning, you should only include phonemes separated by space, do not use other special signs as they might be interpreted as phonemes.

from allosaurus.

kormoczi avatar kormoczi commented on August 12, 2024

Thanks for the answer, @xinjli, I have started to check those links, that you have mentioned.

I would like to use a simple phoneme inventory, of course, if it will be possible, but still I have a lot of question regarding the actual phoneme inventory (I think I can understand the diacritics, so that part is not a question).

Let me give you one example!
The Hungarian word: cica (meaning: cat), the IPA "translation" - I think - should looks like the following: t͡s i t͡s ɒ
translate_tts_hu_cica.zip
At the moment the following command:
python3 -m allosaurus.run --lang hun -i translate_tts_hu_cica.wav
gives the following result: t i z ɒ
Which is not good, but because t͡s is not in the inventory, more or less understandable...
So lets modify the Hungarian inventory - I went through the process
(allosaurus.bin.write_phone, add one new line for t͡s, allosaurus.bin.update_phone, I even checked the inventory),
but the result is still: t i z ɒ.
The really interesting part is, that if I do not specify the language, and just run the following command:
python3 -m allosaurus.run -i translate_tts_hu_cica.wav
the result (very surprisingly for me) will be: tɕ i tʂ ɒ,
which is still not perfect, but much-much closer to the correct result.

And I have even tried to use the topk parameter, but even with topk=5. t͡s does not come out in the result...
Do you have any suggestion? Am I doing something wrong?

(By the way, I have checked Phoible, and according to [https://phoible.org/languages/hung1274], all the different inventories for Hungarian has this ts phoneme (either t͡s, or ts, or t̪s̪).)

Regarding the fine-tuning... You mentioned I should not use special signs...
Does it mean I should put for example 'z' into the text file, and not 'zː', 'z̪' or 'z̻' ?

Thank you and best regards!

from allosaurus.

xinjli avatar xinjli commented on August 12, 2024

it might be the likelihood is tɕ >= t >= t͡s. This is typically caused by the unbalanced training set when I trained the model.
You might want to suppress t or even delete it from your inventory if you do not want it.
You can check the prior customization part in the README, it allows you to suppress the phones and boost other phones.

For the special signs, you can use 'zː', 'z̪' or 'z̻' as long as they are valid IPA.

from allosaurus.

kormoczi avatar kormoczi commented on August 12, 2024

Thanks for the explanation.
At first, I did not wanted to touch the probabilities, as I do not know how it might affect other words...

So I will prepare the datasets for model fine-tuning. I have read in the README, that the audio files should be shorter than 10 seconds. But can you tell me, which one is better, to have only one word in these audio files, or better to have complete (short) sentences? And is there a need for a short silence at the beginning and the end of the audio files or not?

And may I ask, what kind of dataset are you using for the training? Is it contains samples from all the languages? If there are Hungarian samples in it, is it possible to check them?

Thanks!

from allosaurus.

xinjli avatar xinjli commented on August 12, 2024

I think both styles are possible (one word per file, short sentence per file), it depends on your final application, you can use whatever you think appropriate. There does not need to contain silence at the beginning for training.

About the dataset other than English, it was mainly from a corpus collection called Babel dataset, its telephone conversation corpus. You can see the list of corpus from the linked paper. The model available here is using the exact same corpus set, but it is very similar. There were no Hungarian samples when I trained it.

from allosaurus.

kormoczi avatar kormoczi commented on August 12, 2024

I have started the fine-tuning...
I have customized the phoneme inventory, and prepared the train and the validate dataset.
The audio features looks fine (until now), but I have a problem with the text features.
In my dataset, there are texts containing long consonants, like "akkor" (IPA translation: ɑ kː o r).
The phoneme k is in the inventory, kː is not, of course, because it is not a different phoneme, just a long version.
But the text feature script gives an AssertionError because of this. (And this is the same for the other long consonant as well.)
The long vowels works fine (until now), because they are different phonemes in the IPA alphabet as well (like o and oː).

So what should I do? Do I have to put the long consonants into the phoneme inventory as well, or something else?
Thanks!

from allosaurus.

xinjli avatar xinjli commented on August 12, 2024

I might be wrong but as far as I know, only vowels can have this "long version", in your case, k itself is a very short consonant probably should not have a long version, it seems more reasonable to be something like k o: r.

If you still want to distinguish them, you can treat them as two different phonemes (o, o:) and train it

from allosaurus.

dmort27 avatar dmort27 commented on August 12, 2024

This is not correct. There are two ways of transcribing a geminate (or long) consonant: [kk] or [kː]. The first is ambiguous, since it can represent a sequence of two [k]s or a long counterpart of [k].

from allosaurus.

xinjli avatar xinjli commented on August 12, 2024

so maybe we can decompose [k:] into [k] [k] in this case?

from allosaurus.

kormoczi avatar kormoczi commented on August 12, 2024

I don't think we can decompose [k:] into [k] [k].
Anyhow, long consonants are common in Hungarian, so I will try to us two different phonemes for the short and the long versions, and will check the result...

from allosaurus.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.