Giter VIP home page Giter VIP logo

epitran's Issues

backoff hanging at non-linguistic input

Just testing this out for fun... one thing that I notice is that the backoff feature seems to hang if it gets some input that isn't in its alphabets, e.g.

epi = epitran.Epitran('tur-Latn')
epi.transliterate('merhaba: nasilsin?!')
'meɾhaba: nasilsin?!'

works pretty quick (and would be better achieved if I had used an actual Turkish keyboard)

Wheras

backoff = Backoff(['tur-Latn', 'hin-Deva'])
backoff.transliterate('merhaba: nasilsin?!')

seems to hangs indefinitely...

Maybe backoff needs an extra something for pass-throughs? :)

Function xsampa_list() in _eptiran.py deletes things a lot

For instance in cebuano

felix --> [e, l, i]
x --> []

In swedish

och --> []

I fixed this (I think), by simply replacing the commented line below with the uncommented one. Maybe this is horribly wrong, but it seems to work now.

#ipa_segs = self.ft.ipa_segs(self.epi.strict_trans(word, normpunc,
# ligaturize))
ipa_segs = self.ft.segs_safe(self.epi.transliterate(word, normpunc, ligaturize))

EpihanTraditional Class does not have regexp attribute

When running Backoff class with "cmn-Hant" (which uses EpihanTraditional), it complains with the following error:

File "/usr2/home/amuis/anaconda3/envs/py36/lib/python3.6/site-packages/epitran/backoff.py", line 46, in transliterate
    m = lang.epi.regexp.match(dia.process(token))
AttributeError: 'EpihanTraditional' object has no attribute 'regexp'

This can be easily fixed by adding the following line at the end of https://github.com/dmort27/epitran/blob/master/epitran/epihan.py

self.regexp = re.compile(r'\p{Han}')

I assume the character class "[{Han}" captures both traditional and simplified Chinese.

attempt to transliterate english not working, despite flite being installed properly

I've been trying to use the english transliteration, without success.

I did follow installation instructions for flite (and also copied the relevant binaries in the /usr/local/bin), and the process seems to have worked since I do not get anymore the "lex_lookup not installed" kind of error.

However, I'm still stuck at a rather cryptic KeyError. When I do (in python3):

import epitran
epi = epitran.Epitran('eng-Latn')
epi.transliterate('Berkeley')

this is what I get:

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-8-e30894fd177f> in <module>
----> 1 epi.transliterate('Berkeley')

~/.local/lib/python3.7/site-packages/epitran/_epitran.py in transliterate(self, word, normpunc, ligatures)
     60             unicode: IPA string
     61         """
---> 62         return self.epi.transliterate(word, normpunc, ligatures)
     63 
     64     def reverse_transliterate(self, ipa):

~/.local/lib/python3.7/site-packages/epitran/flite.py in transliterate(self, text, normpunc, ligatures)
     89         for chunk in self.chunk_re.findall(text):
     90             if self.letter_re.match(chunk):
---> 91                 acc.append(self.english_g2p(chunk))
     92             else:
     93                 acc.append(chunk)

~/.local/lib/python3.7/site-packages/epitran/flite.py in english_g2p(self, text)
    205             logging.warning('Non-zero exit status from lex_lookup.')
    206             arpa_text = ''
--> 207         return self.arpa_to_ipa(arpa_text)

~/.local/lib/python3.7/site-packages/epitran/flite.py in arpa_to_ipa(self, arpa_text, ligatures)
     73         arpa_list = map(lambda d: re.sub('\d', '', d), arpa_list)
     74         ipa_list = map(lambda d: self.arpa_map[d], arpa_list)
---> 75         text = ''.join(ipa_list)
     76         return text
     77 

~/.local/lib/python3.7/site-packages/epitran/flite.py in <lambda>(d)
     72         arpa_list = self.arpa_text_to_list(arpa_text)
     73         arpa_list = map(lambda d: re.sub('\d', '', d), arpa_list)
---> 74         ipa_list = map(lambda d: self.arpa_map[d], arpa_list)
     75         text = ''.join(ipa_list)
     76         return text

KeyError: 'iy)\n(b'

No matter my query, it seems self.arpa_map does not have it. What am I doing wrong?

epitran.exceptions.DatafileError: Header is ["Prth", "Phon"] instead of ["Orth", "Phon"].

I'm trying to use epitran to obtain the correct phonetic pronunciations of French words. I did get it working eventually through the use of the fra-Latn preprocessor, however its performance is lackluster. It seems to give me very literal translations, and ones that never use the uvular "ʁ" sound or the sound separating ".":

  • "acteur" ("actor") comes out as "atyr" (should be "ak.tœʁ")
  • "actrice" ("actress") comes out as "aktriz" (should be "ak.tʁis")
  • "chat" ("cat") comes out as "ʃa", which is correct, but at least one time when I tried it I got trailing symbols, like "ʃat"
  • "chien" ("dog") comes out as "ʃjâ" when it should be "ʃjɛ̃"

So after having mixed performance with that, I looked at the documentation and noticed there was a more phonetic translator "fra-Latn-np". Upon attempting to use this to translate any given word, I get the following error:

Traceback (most recent call last):
  File "main.py", line 6, in <module>
    epi = epitran.Epitran('fra-Latn-np')
  File "/home/callum/.local/share/virtualenvs/first625-xxpZk1TH/lib/python3.8/site-packages/epitran/_epitran.py", line 46, in __init__
    self.epi = SimpleEpitran(code, preproc, postproc, ligatures, rev, rev_preproc, rev_postproc, tones=tones)
  File "/home/callum/.local/share/virtualenvs/first625-xxpZk1TH/lib/python3.8/site-packages/epitran/simple.py", line 43, in __init__
    self.g2p = self._load_g2p_map(code, False)
  File "/home/callum/.local/share/virtualenvs/first625-xxpZk1TH/lib/python3.8/site-packages/epitran/simple.py", line 100, in _load_g2p_map
    raise DatafileError('Header is ["{}", "{}"] instead of ["Orth", "Phon"].'.format(orth, phon))
epitran.exceptions.DatafileError: Header is ["Prth", "Phon"] instead of ["Orth", "Phon"].

I'm not sure what causes it, but looking in that directory there is also an undocumented "fra-Lang-p" preprocessor, which does better at other times and worse than others. Could you please explain what is going on here?

Here is my code:

import sys
from google_trans_new import google_translator
import epitran

translator = google_translator()
epi = epitran.Epitran('fra-Latn-np')

# Translate the first system argument
#translated_text = translator.translate(sys.argv[1], lang_src='en', lang_tgt='fr')
# Get the IPA pronunciation
#ipa_symbols = epi.transliterate(translated_text)

#print(translated_text)
#print(ipa_symbols)
print(epi.transliterate(sys.argv[1]))

Does context matter?

Should I feed the individual words to the epitran or a whole sentence to the epitran?
Are there any rules using contexts to transliterate the words?

German transliteration issues

Hello,

I came across what I believe to be a bug in German transliteration of the grapheme 's'. This occurs when using the 'deu-Latn' and the 'deu-Latn-nar' dictionaries. Take for example the word 'sehr':

In [14]: epi1.transliterate('sehr')
Out[14]: 't͡seːə'
In [16]: epi3.transliterate('sehr')
Out[16]: 't͡seːɐ'

Here epi1 was initialized with the 'deu-Latn' dictionary and epi3 with the 'deu-Latn-nar' dictionary.

In both cases I would expect the 's' in 'sehr' to be transliterated with [z]. I know that [s] is also possible in this case when dealing with southern German dialects, and I see this transliteration when using the 'deu-Latn-np' dictionary. However, after consulting all my sources, I don't see a case where this can be transliterated as [t͡s].

Another example would be the word 'Stock':

In [20]: epi1.transliterate('Stock')
Out[20]: 'stok'
In [21]: epi3.transliterate('Stock')
Out[21]: 'stok'

In the case of the 'deu-Latn' example, I can understand why this may be transliterated as [s], but at least with the narrow transliteration I would expect [ʃ]. As far as I know, [s] only occurs in this environment in northern German dialects.

Would you mind investigating this with me? What I've done so far is look at 10s of examples(I'm transliterating a large corpus) and it seems that it happens across the board, no exceptions. I also made sure that I pip installed the latest version of Epitran.

Incorrect grapheme-phoneme alignment in word_to_tuple response

Thank you for this great tool!
I was hoping to use Epitran to extract frequencies of grapheme-phoneme alignment in different languages. But I am running into issues when using the word_to_tuples and word_to_segs features.

Here is the output of epi.word_to_tuples for the word tough in English

('L', 0, 't', 't', [('t', <map object at 0x113817c50>)])
('L', 0, 'o', 'ʌ', [('ʌ', <map object at 0x113817250>)])
('L', 0, 'u', 'f', [('f', <map object at 0x1120a06d0>)])
('L', 0, 'g', '', [(-1, [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])])
('L', 0, 'h', '', [(-1, [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])])

Here is the output for choice

('L', 0, 'c', 't͡ʃ', [('t͡ʃ', <map object at 0x11380cad0>)])
('L', 0, 'h', 'o', [('o', <map object at 0x11380c5d0>)])
('L', 0, 'o', 'j', [('j', <map object at 0x11380cb10>)])
('L', 0, 'i', 's', [('s', <map object at 0x1120a0fd0>)])
('L', 0, 'c', '', [(-1, [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])])
('L', 0, 'e', '', [(-1, [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])])

I'd expect the phonetic form /f/ in tough to correspond to either g or h. And the phonetic form /s/ in choice to correspond to c. However, that's not the case. I am wondering if this is expected behavior or a bug?

Incorrect transliteration of Yoruba /j/

It seems that Yoruba, which uses ⟨y⟩ for the approximant /j/, is being incorrectly transcribed such that ⟨y⟩ becomes the vowel /y/.

>>> import epitran
>>> translator = epitran.Epitran('yor-Latn')
>>> translator.transliterate('Yorùbá') # expected output: 'jōrùbá'
'yorùbá'

KeyError when trying to transcribe any text

Traceback (most recent call last):
  File "d:/tcritp/tcript.py", line 3, in <module>
    tr.transliterate(u'spark')
  File "D:\Python3\lib\site-packages\epitran\_epitran.py", line 62, in transliterate
    return self.epi.transliterate(word, normpunc, ligatures)
  File "D:\Python3\lib\site-packages\epitran\flite.py", line 92, in transliterate
    acc.append(self.english_g2p(chunk))
  File "D:\Python3\lib\site-packages\epitran\flite.py", line 211, in english_g2p
    return self.arpa_to_ipa(arpa_text)
  File "D:\Python3\lib\site-packages\epitran\flite.py", line 76, in arpa_to_ipa
    text = ''.join(ipa_list)
  File "D:\Python3\lib\site-packages\epitran\flite.py", line 75, in <lambda>
    ipa_list = map(lambda d: self.arpa_map[d], arpa_list)
KeyError: ''

Source:

from epitran import Epitran
tr = Epitran('eng-Latn', cedict_file='cedict_1_0_ts_utf-8_mdbg.txt')
tr.transliterate('test')

Note: Changing 'test' to u'test' does not help.

IndexError: list index out of range

When I am running this code:

import epitran
epi = epitran.Epitran('eng-Latn')
print (epi.transliterate(u'Berkeley'))

I am using python+3. Would you please help me to fix this error?

File "/home/hamada/.local/lib/python3.6/site-packages/epitran/flite.py", line 212, in english_g2p
arpa_text = arpa_text.splitlines()[0]
IndexError: list index out of range

Can't pip install on Python 3

On Debian stretch:

$ pip3 install epitran
Collecting epitran
  Using cached epitran-0.23-py2.py3-none-any.whl
Collecting marisa-trie (from epitran)
  Using cached marisa_trie-0.7.4-cp35-cp35m-manylinux1_x86_64.whl
Collecting panphon>=0.12 (from epitran)
  Using cached panphon-0.12-py2.py3-none-any.whl
Collecting unicodecsv (from epitran)
Collecting subprocess32 (from epitran)
  Using cached subprocess32-3.2.7.tar.gz
    Complete output from command python setup.py egg_info:
    This backport is for Python 2.x only.
    
    ----------------------------------------
Command "python setup.py egg_info" failed with error code 1 in /tmp/pip-build-7hr0xfki/subprocess32/

How to get offset mapping?

For example I have epi.transliterate('янъ') -> jan
With word_to_tuples I can find skipped 'ъ', but how can I know what 'я' occupy the first two indexes?

ERROR! Related to MICORSOFT VISUAL STUDIO C++

Hello can someone help me with this error? I've already updated my Microsoft Visual Studio cause that was the first error now I am getting this. Why is this happening? Thank you!

C:\Users\LENOVO>pip install epitran
Collecting epitran
  Using cached epitran-1.8-py2.py3-none-any.whl (132 kB)
Requirement already satisfied: regex in c:\users\lenovo\appdata\local\programs\python\python38\lib\site-packages (from epitran) (2020.7.14)
Requirement already satisfied: unicodecsv in c:\users\lenovo\appdata\local\programs\python\python38\lib\site-packages (from epitran) (0.14.1)
Requirement already satisfied: setuptools in c:\users\lenovo\appdata\local\programs\python\python38\lib\site-packages (from epitran) (50.3.1)
Collecting marisa-trie
  Using cached marisa-trie-0.7.5.tar.gz (270 kB)
Collecting panphon>=0.16
  Using cached panphon-0.17-py2.py3-none-any.whl (71 kB)
Requirement already satisfied: PyYAML in c:\users\lenovo\appdata\local\programs\python\python38\lib\site-packages (from panphon>=0.16->epitran) (5.3)
Collecting editdistance
  Using cached editdistance-0.5.3.tar.gz (27 kB)
Requirement already satisfied: numpy in c:\users\lenovo\appdata\local\programs\python\python38\lib\site-packages (from panphon>=0.16->epitran) (1.18.0)
Requirement already satisfied: munkres in c:\users\lenovo\appdata\local\programs\python\python38\lib\site-packages (from panphon>=0.16->epitran) (1.1.4)
Building wheels for collected packages: marisa-trie, editdistance
  Building wheel for marisa-trie (setup.py) ... error
  ERROR: Command errored out with exit status 1:
   command: 'c:\users\lenovo\appdata\local\programs\python\python38\python.exe' -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'C:\\Users\\LENOVO\\AppData\\Local\\Temp\\pip-install-n07jydhk\\marisa-trie\\setup.py'"'"'; __file__='"'"'C:\\Users\\LENOVO\\AppData\\Local\\Temp\\pip-install-n07jydhk\\marisa-trie\\setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' bdist_wheel -d 'C:\Users\LENOVO\AppData\Local\Temp\pip-wheel-51uiu7_u'
       cwd: C:\Users\LENOVO\AppData\Local\Temp\pip-install-n07jydhk\marisa-trie\
  Complete output (23 lines):
  running bdist_wheel
  running build
  running build_clib
  building 'libmarisa-trie' library
  creating build
  creating build\temp.win-amd64-3.8
  creating build\temp.win-amd64-3.8\marisa-trie
  creating build\temp.win-amd64-3.8\marisa-trie\lib
  creating build\temp.win-amd64-3.8\marisa-trie\lib\marisa
  creating build\temp.win-amd64-3.8\marisa-trie\lib\marisa\grimoire
  creating build\temp.win-amd64-3.8\marisa-trie\lib\marisa\grimoire\io
  creating build\temp.win-amd64-3.8\marisa-trie\lib\marisa\grimoire\trie
  creating build\temp.win-amd64-3.8\marisa-trie\lib\marisa\grimoire\vector
  C:\Program Files (x86)\Microsoft Visual Studio\2019\BuildTools\VC\Tools\MSVC\14.27.29110\bin\HostX86\x64\cl.exe /c /nologo /Ox /W3 /GL /DNDEBUG /MD -Imarisa-trie\lib -Imarisa-trie\include "-IC:\Program Files (x86)\Microsoft Visual Studio\2019\BuildTools\VC\Tools\MSVC\14.27.29110\include" "-IC:\Program Files (x86)\Windows Kits\NETFXSDK\4.8\include\um" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.10240.0\ucrt" /EHsc /Tpmarisa-trie\lib\marisa\agent.cc /Fobuild\temp.win-amd64-3.8\marisa-trie\lib\marisa\agent.obj
  agent.cc
  C:\Program Files (x86)\Microsoft Visual Studio\2019\BuildTools\VC\Tools\MSVC\14.27.29110\bin\HostX86\x64\cl.exe /c /nologo /Ox /W3 /GL /DNDEBUG /MD -Imarisa-trie\lib -Imarisa-trie\include "-IC:\Program Files (x86)\Microsoft Visual Studio\2019\BuildTools\VC\Tools\MSVC\14.27.29110\include" "-IC:\Program Files (x86)\Windows Kits\NETFXSDK\4.8\include\um" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.10240.0\ucrt" /EHsc /Tpmarisa-trie\lib\marisa\keyset.cc /Fobuild\temp.win-amd64-3.8\marisa-trie\lib\marisa\keyset.obj
  keyset.cc
  C:\Program Files (x86)\Microsoft Visual Studio\2019\BuildTools\VC\Tools\MSVC\14.27.29110\bin\HostX86\x64\cl.exe /c /nologo /Ox /W3 /GL /DNDEBUG /MD -Imarisa-trie\lib -Imarisa-trie\include "-IC:\Program Files (x86)\Microsoft Visual Studio\2019\BuildTools\VC\Tools\MSVC\14.27.29110\include" "-IC:\Program Files (x86)\Windows Kits\NETFXSDK\4.8\include\um" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.10240.0\ucrt" /EHsc /Tpmarisa-trie\lib\marisa\trie.cc /Fobuild\temp.win-amd64-3.8\marisa-trie\lib\marisa\trie.obj
  trie.cc
  C:\Program Files (x86)\Microsoft Visual Studio\2019\BuildTools\VC\Tools\MSVC\14.27.29110\bin\HostX86\x64\cl.exe /c /nologo /Ox /W3 /GL /DNDEBUG /MD -Imarisa-trie\lib -Imarisa-trie\include "-IC:\Program Files (x86)\Microsoft Visual Studio\2019\BuildTools\VC\Tools\MSVC\14.27.29110\include" "-IC:\Program Files (x86)\Windows Kits\NETFXSDK\4.8\include\um" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.10240.0\ucrt" /EHsc /Tpmarisa-trie\lib\marisa/grimoire/io\mapper.cc /Fobuild\temp.win-amd64-3.8\marisa-trie\lib\marisa/grimoire/io\mapper.obj
  mapper.cc
  marisa-trie\lib\marisa/grimoire/io\mapper.cc(4): fatal error C1083: Cannot open include file: 'windows.h': No such file or directory
  error: command 'C:\\Program Files (x86)\\Microsoft Visual Studio\\2019\\BuildTools\\VC\\Tools\\MSVC\\14.27.29110\\bin\\HostX86\\x64\\cl.exe' failed with exit status 2
  ----------------------------------------
  ERROR: Failed building wheel for marisa-trie
  Running setup.py clean for marisa-trie
  Building wheel for editdistance (setup.py) ... error
  ERROR: Command errored out with exit status 1:
   command: 'c:\users\lenovo\appdata\local\programs\python\python38\python.exe' -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'C:\\Users\\LENOVO\\AppData\\Local\\Temp\\pip-install-n07jydhk\\editdistance\\setup.py'"'"'; __file__='"'"'C:\\Users\\LENOVO\\AppData\\Local\\Temp\\pip-install-n07jydhk\\editdistance\\setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' bdist_wheel -d 'C:\Users\LENOVO\AppData\Local\Temp\pip-wheel-cibnnpm3'
       cwd: C:\Users\LENOVO\AppData\Local\Temp\pip-install-n07jydhk\editdistance\
  Complete output (30 lines):
  running bdist_wheel
  running build
  running build_py
  creating build
  creating build\lib.win-amd64-3.8
  creating build\lib.win-amd64-3.8\editdistance
  copying editdistance\__init__.py -> build\lib.win-amd64-3.8\editdistance
  copying editdistance\_editdistance.h -> build\lib.win-amd64-3.8\editdistance
  copying editdistance\def.h -> build\lib.win-amd64-3.8\editdistance
  running build_ext
  building 'editdistance.bycython' extension
  creating build\temp.win-amd64-3.8
  creating build\temp.win-amd64-3.8\Release
  creating build\temp.win-amd64-3.8\Release\editdistance
  C:\Program Files (x86)\Microsoft Visual Studio\2019\BuildTools\VC\Tools\MSVC\14.27.29110\bin\HostX86\x64\cl.exe /c /nologo /Ox /W3 /GL /DNDEBUG /MD -I./editdistance -Ic:\users\lenovo\appdata\local\programs\python\python38\include -Ic:\users\lenovo\appdata\local\programs\python\python38\include "-IC:\Program Files (x86)\Microsoft Visual Studio\2019\BuildTools\VC\Tools\MSVC\14.27.29110\include" "-IC:\Program Files (x86)\Windows Kits\NETFXSDK\4.8\include\um" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.10240.0\ucrt" /EHsc /Tpeditdistance/_editdistance.cpp /Fobuild\temp.win-amd64-3.8\Release\editdistance/_editdistance.obj
  _editdistance.cpp
  editdistance/_editdistance.cpp(91): warning C4267: 'initializing': conversion from 'size_t' to 'unsigned int', possible loss of data
  editdistance/_editdistance.cpp(119): note: see reference to function template instantiation 'unsigned int edit_distance_map_<1>(const int64_t *,const size_t,const int64_t *,const size_t)' being compiled
  editdistance/_editdistance.cpp(92): warning C4267: 'initializing': conversion from 'size_t' to 'unsigned int', possible loss of data
  editdistance/_editdistance.cpp(44): warning C4018: '<=': signed/unsigned mismatch
  editdistance/_editdistance.cpp(97): note: see reference to function template instantiation 'unsigned int edit_distance_bpv<cmap_v,varr<1>>(T &,const int64_t *,const size_t &,const unsigned int &,const unsigned int &)' being compiled
          with
          [
              T=cmap_v
          ]
  editdistance/_editdistance.cpp(119): note: see reference to function template instantiation 'unsigned int edit_distance_map_<1>(const int64_t *,const size_t,const int64_t *,const size_t)' being compiled
  C:\Program Files (x86)\Microsoft Visual Studio\2019\BuildTools\VC\Tools\MSVC\14.27.29110\bin\HostX86\x64\cl.exe /c /nologo /Ox /W3 /GL /DNDEBUG /MD -I./editdistance -Ic:\users\lenovo\appdata\local\programs\python\python38\include -Ic:\users\lenovo\appdata\local\programs\python\python38\include "-IC:\Program Files (x86)\Microsoft Visual Studio\2019\BuildTools\VC\Tools\MSVC\14.27.29110\include" "-IC:\Program Files (x86)\Windows Kits\NETFXSDK\4.8\include\um" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.10240.0\ucrt" /EHsc /Tpeditdistance/bycython.cpp /Fobuild\temp.win-amd64-3.8\Release\editdistance/bycython.obj
  bycython.cpp
  c:\users\lenovo\appdata\local\programs\python\python38\include\pyconfig.h(206): fatal error C1083: Cannot open include file: 'basetsd.h': No such file or directory
  error: command 'C:\\Program Files (x86)\\Microsoft Visual Studio\\2019\\BuildTools\\VC\\Tools\\MSVC\\14.27.29110\\bin\\HostX86\\x64\\cl.exe' failed with exit status 2
  ----------------------------------------
  ERROR: Failed building wheel for editdistance
  Running setup.py clean for editdistance
Failed to build marisa-trie editdistance
Installing collected packages: marisa-trie, editdistance, panphon, epitran
    Running setup.py install for marisa-trie ... error
    ERROR: Command errored out with exit status 1:
     command: 'c:\users\lenovo\appdata\local\programs\python\python38\python.exe' -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'C:\\Users\\LENOVO\\AppData\\Local\\Temp\\pip-install-n07jydhk\\marisa-trie\\setup.py'"'"'; __file__='"'"'C:\\Users\\LENOVO\\AppData\\Local\\Temp\\pip-install-n07jydhk\\marisa-trie\\setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' install --record 'C:\Users\LENOVO\AppData\Local\Temp\pip-record-teibnt8o\install-record.txt' --single-version-externally-managed --compile --install-headers 'c:\users\lenovo\appdata\local\programs\python\python38\Include\marisa-trie'
         cwd: C:\Users\LENOVO\AppData\Local\Temp\pip-install-n07jydhk\marisa-trie\
    Complete output (23 lines):
    running install
    running build
    running build_clib
    building 'libmarisa-trie' library
    creating build
    creating build\temp.win-amd64-3.8
    creating build\temp.win-amd64-3.8\marisa-trie
    creating build\temp.win-amd64-3.8\marisa-trie\lib
    creating build\temp.win-amd64-3.8\marisa-trie\lib\marisa
    creating build\temp.win-amd64-3.8\marisa-trie\lib\marisa\grimoire
    creating build\temp.win-amd64-3.8\marisa-trie\lib\marisa\grimoire\io
    creating build\temp.win-amd64-3.8\marisa-trie\lib\marisa\grimoire\trie
    creating build\temp.win-amd64-3.8\marisa-trie\lib\marisa\grimoire\vector
    C:\Program Files (x86)\Microsoft Visual Studio\2019\BuildTools\VC\Tools\MSVC\14.27.29110\bin\HostX86\x64\cl.exe /c /nologo /Ox /W3 /GL /DNDEBUG /MD -Imarisa-trie\lib -Imarisa-trie\include "-IC:\Program Files (x86)\Microsoft Visual Studio\2019\BuildTools\VC\Tools\MSVC\14.27.29110\include" "-IC:\Program Files (x86)\Windows Kits\NETFXSDK\4.8\include\um" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.10240.0\ucrt" /EHsc /Tpmarisa-trie\lib\marisa\agent.cc /Fobuild\temp.win-amd64-3.8\marisa-trie\lib\marisa\agent.obj
    agent.cc
    C:\Program Files (x86)\Microsoft Visual Studio\2019\BuildTools\VC\Tools\MSVC\14.27.29110\bin\HostX86\x64\cl.exe /c /nologo /Ox /W3 /GL /DNDEBUG /MD -Imarisa-trie\lib -Imarisa-trie\include "-IC:\Program Files (x86)\Microsoft Visual Studio\2019\BuildTools\VC\Tools\MSVC\14.27.29110\include" "-IC:\Program Files (x86)\Windows Kits\NETFXSDK\4.8\include\um" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.10240.0\ucrt" /EHsc /Tpmarisa-trie\lib\marisa\keyset.cc /Fobuild\temp.win-amd64-3.8\marisa-trie\lib\marisa\keyset.obj
    keyset.cc
    C:\Program Files (x86)\Microsoft Visual Studio\2019\BuildTools\VC\Tools\MSVC\14.27.29110\bin\HostX86\x64\cl.exe /c /nologo /Ox /W3 /GL /DNDEBUG /MD -Imarisa-trie\lib -Imarisa-trie\include "-IC:\Program Files (x86)\Microsoft Visual Studio\2019\BuildTools\VC\Tools\MSVC\14.27.29110\include" "-IC:\Program Files (x86)\Windows Kits\NETFXSDK\4.8\include\um" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.10240.0\ucrt" /EHsc /Tpmarisa-trie\lib\marisa\trie.cc /Fobuild\temp.win-amd64-3.8\marisa-trie\lib\marisa\trie.obj
    trie.cc
    C:\Program Files (x86)\Microsoft Visual Studio\2019\BuildTools\VC\Tools\MSVC\14.27.29110\bin\HostX86\x64\cl.exe /c /nologo /Ox /W3 /GL /DNDEBUG /MD -Imarisa-trie\lib -Imarisa-trie\include "-IC:\Program Files (x86)\Microsoft Visual Studio\2019\BuildTools\VC\Tools\MSVC\14.27.29110\include" "-IC:\Program Files (x86)\Windows Kits\NETFXSDK\4.8\include\um" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.10240.0\ucrt" /EHsc /Tpmarisa-trie\lib\marisa/grimoire/io\mapper.cc /Fobuild\temp.win-amd64-3.8\marisa-trie\lib\marisa/grimoire/io\mapper.obj
    mapper.cc
    marisa-trie\lib\marisa/grimoire/io\mapper.cc(4): fatal error C1083: Cannot open include file: 'windows.h': No such file or directory
    error: command 'C:\\Program Files (x86)\\Microsoft Visual Studio\\2019\\BuildTools\\VC\\Tools\\MSVC\\14.27.29110\\bin\\HostX86\\x64\\cl.exe' failed with exit status 2
    ----------------------------------------
ERROR: Command errored out with exit status 1: 'c:\users\lenovo\appdata\local\programs\python\python38\python.exe' -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'C:\\Users\\LENOVO\\AppData\\Local\\Temp\\pip-install-n07jydhk\\marisa-trie\\setup.py'"'"'; __file__='"'"'C:\\Users\\LENOVO\\AppData\\Local\\Temp\\pip-install-n07jydhk\\marisa-trie\\setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' install --record 'C:\Users\LENOVO\AppData\Local\Temp\pip-record-teibnt8o\install-record.txt' --single-version-externally-managed --compile --install-headers 'c:\users\lenovo\appdata\local\programs\python\python38\Include\marisa-trie' Check the logs for full command output.

C:\Users\LENOVO>

flite and t2p are provided in Ubuntu packages

You can install flite by sudo apt install flite. t2p is included.

https://packages.ubuntu.com/search?keywords=flite&searchon=names&exact=1&suite=all&section=all

$ dpkg -S /usr/bin/t2p
flite: /usr/bin/t2p
$ flite --version
  Carnegie Mellon University, Copyright (c) 1999-2016, all rights reserved
  version: flite-2.1-release Dec 2017 (http://cmuflite.org)
$ cat /etc/os-release
NAME="Ubuntu"
VERSION="18.04.5 LTS (Bionic Beaver)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 18.04.5 LTS"
VERSION_ID="18.04"
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
VERSION_CODENAME=bionic
UBUNTU_CODENAME=bionic

lex_lookup must be still built from source.

Citing Epitran

Hi!

I used this library for some work that I am writing a paper on. Is there something that I can cite? I should note that I also used Panphon and cited appropriately from the paper linked in that README.

Problem with transliterating English contractions

How does Epitran transliterate contractions? It seems that the package has difficulties with it. for example:

  • everyone's is transliterated as /ɛvɹiownz/ instead of /ɛvɹiwonz/ (o and w are wrongly reversed)
  • aren't is transliterated as /ɑɹənt/ instead of /ɑɹnt/. (a vowel is wrongly inserted)

Simply concatenating the contractions seem to give better results in some cases. Why is that the case?

errors for Italian

epi = epitran.Epitran("ita-Latn") 
epi.transliterate("motorizzazione") 

return 'motorit͡sasione', but it should be, at least, 'motorit͡sat͡sione' or, better, 'motorit͡st͡sat͡st͡sione' (semivowel "i" should be "j", but I do not know how fine-grained the transliteration is supposed to be)

Bad cyrrillic simbols sometimes

So I use this code:

from epitran.backoff import Backoff

backoff = Backoff(['fas-Arab', 'rus-Cyrl'])
                  
backoff.transliterate('Привет дорогой друг пидор')

and it gives

'prʲivʲet doroɡoй druɡ pʲidor'

as u see, there is the russian й in result, which should be (maybe) j. Or am I wrong?

Regex + rules difference between [] and ().

In fra-Latn.txt preprocessors there are some matches that use [] and others that use ()

::vowel:: = a|á|â|æ|e|é|è|ê|ë|i|î|ï|o|ô|œ|u|ù|û|ü|A|Á|Â|Æ|E|É|È|Ê|Ë|I|Î|Ï|O|Ô|Œ|U|Ù|Û|Ü|ɛ
::front_vowel:: = e|é|è|ê|ë|i|î|ï|y|E|É|È|Ê|Ë|I|Î|Ï|Y|ɛ
::consonant:: = b|ç|c|ch|d|f|g|j|k|l|m|n|p|r|s|t|v|w|z|ʒ
% Treatment of <c> and <s>
sc -> s / _ [::front_vowel::]
c -> s / _ [::front_vowel::]

% High vowels become glides before vowels
ou -> w / _ (::vowel::)
u -> ɥ / _ (::vowel::)

Is a difference in behaviour between the two?

Am I right in thinking that:

  • [::front_vowel::] is [e|é|è|ê|ë|i|î|ï|y|E|É|È|Ê|Ë|I|Î|Ï|Y|ɛ] in regex.
  • (::front_vowel::) is (e|é|è|ê|ë|i|î|ï|y|E|É|È|Ê|Ë|I|Î|Ï|Y|ɛ) in regex.

From what I understand of regex they look as though they'd do the same thing, except [::front_vowel::] would also match the | char.

I also don't think [] would work if there are two or more chars in a group, for example ch in:

::consonant:: = b|ç|c|ch|d|f|g|j|k|l|m|n|p|r|s|t|v|w|z|ʒ

I'd guess that () also create capturing groups but I'm not sure if that's being utilised.

Any guidance would be greatly appreciated.

Different IPA with punctuation in German

Hello all,

First off I wanted to say well done on Epitran! It is a tool that has proven useful for many projects of mine.

I stumbled across something today and I wanted to know if Epitran was designed to do this, or if it's a bug. I noticed that when I transliterate words in German, they have a different IPA transliterations when adding punctuation. As far as I can tell, this doesn't happen in any other language (I tried to reproduce the error in Polish, Russian and English.).

Examples:

  • transliterate('heute') -> 'hoytə' but transliterate('heute?') -> 'hoyhte?'
  • transliterate('Tag') -> 'tak' but transliterate('Tag!') -> 'taɡ!'
  • transliterate('Ende') -> 'əndə' but transliterate('Ende.') -> 'ənde.'

For the last two examples, I could accept the transliterations that are produced when punctuation is added to the string, when phonetic environment and dialect are taken into consideration. However, to my knowledge, I don't know of any case where 'heute' should have an 'h' after the diphthong in its transliteration.

When using the transliterate function, I normally use normpunc=True and ligatures=True, but even disabling those flags produces the same results. I also used pip to check that I was using the latest version of Epitran.

I would really appreciate some info on this matter, as it will guide my future projects. Thanks a lot for your time!

In French, the final 's' should be silent, 'es' shouldn't after a consonant

Hello, I just discoverd this awesome module, and I found two issues with the French language (both with fra-Latn and fra-Latn-np).

Final 's'

When a word ends with a 's', the 's' is silent. So "il" ("he / she") is pronounced in the same way as "ils" ("they"). However, when I try epi.transliterate("il") and epi.transliterate("ils"), it returns il and ils.

Final 'es'

The final 'es' is pronounced when it comes after a consonant. For example, "faites" ("do") is pronounced "fɛt" and "fait" ("done") is pronounced "fɛ". But transliterate() returns "fe" and "fe".

In the same way, it returns "ɡaraʒ" for "garage" (which is correct, Wikitionary gives "ɡa.ʁaʒ") but "ɡara" for "garages".

Duplicated entries in ipa-xsampa.csv

I dowloaded ipa-xsampa.csv and find some errors in the data, e.g.

  • R\ in X.SAMPA maps to vd uvular fricative and vl uvular trill
  • glottal plosive has two identical rows

I modified them based on wikipedia. I think you may like to check the modified file: ipa-xsampa-modified.csv.txt. Note that I modified the file according to my requirements, so it might not suit your needs.

Thanks for the data!

lex_lookup (from flite) is not installed

How to solve this problem?

import epitran

epi = epitran.Epitran('eng-Latn')

epi.transliterate('Hello')
WARNING:root:lex_lookup (from flite) is not installed.
Traceback (most recent call last):

  File "<ipython-input-3-9e6f98d7c4c9>", line 1, in <module>
    epi.transliterate('Hello')

  File "C:\ProgramData\Anaconda3\lib\site-packages\epitran\_epitran.py", line 62, in transliterate
    return self.epi.transliterate(word, normpunc, ligatures)

  File "C:\ProgramData\Anaconda3\lib\site-packages\epitran\flite.py", line 94, in transliterate
    acc.append(self.english_g2p(chunk))

  File "C:\ProgramData\Anaconda3\lib\site-packages\epitran\flite.py", line 212, in english_g2p
    arpa_text = arpa_text.splitlines()[0]

IndexError: list index out of range

It's working for other languages but not english

help please: couldn't get the amh-Ethi working

Here is the traceback for using the Epitran("amh-Ethi"), for other languages it works fine.

import epitran
epi = epitran.Epitran("amh-Ethi")
Traceback (most recent call last):
File "", line 1, in
File "epitran/_epitran.py", line 42, in init
self.epi = SimpleEpitran(code, preproc, postproc, ligatures)
File "epitran/simple.py", line 52, in init
self.postprocessor = PrePostProcessor(code, 'post')
File "epitran/ppprocessor.py", line 28, in init
self.rules = self._read_rules(code, fix)
File "epitran/ppprocessor.py", line 38, in _read_rules
return Rules([abs_fn])
File "epitran/rules.py", line 28, in init
rules = self._read_rule_file(rule_file)
File "epitran/rules.py", line 36, in _read_rule_file
rules.append(self._read_rule(line))
File "epitran/rules.py", line 65, in _read_rule
return self._fields_to_function(a, b, X, Y)
File "epitran/rules.py", line 81, in _fields_to_function
regexp = re.compile(left)
File "/home/anaconda2/lib/python2.7/site-packages/regex.py", line 345, in compile
return _compile(pattern, flags, kwargs)
File "/home/anaconda2/lib/python2.7/site-packages/regex.py", line 490, in _compile
caught_exception.pos)
_regex_core.error: missing ) at position 53

No rule to make target 'lex_lookup'

Hello

I am trying to install lex_lookup, since I wish to convert an English text to API.
I ran Cygwin using Windows 10. I followed the instructions, including changing the "cp -pd" to "cp -pR" in the relevant flite-2.0.5-current\main\Makefile file. However, I cannot manage to run this command - "make lex_lookup".

Thank you very much for your help.

Unable to run word_to_tuples for English

I am facing an issue running the model for English. I have installed Flite and am able to run c = os.system(command) from my python script as well.
I get the following warning:
WARNING:root:lex_lookup (from flite) is not installed. Did anyone else face this issue? Could you let me know how you have solved it? Thanks!

Support for other English varieties

Hello,

If I understand correctly, if you use Flite as the backend for English G2P, you get transcriptions in US English. How would one go about getting transcriptions for other varieties of English, e.g. Received Pronunciation or Australian English? I know that Festvox supports British and Scottish English, so could it in theory be used as the backend for English G2P?

For my use case, it's not super important that the vowels are precise, but the rhoticity distinction would be extremely useful.

Thanks!

Bengali script sometimes leaves Bengali characters in transcriptions

IPA transliterations of Bengali characters with Chandrabindus in them leave the Chandrabindu there, when it should be replaced with a combining tilde, the corresponding IPA character. With epitran 0.56 installed:

>>> import epitran
>>> translator = epitran.Epitran('ben-Beng')
>>> translator.transliterate('হাঁ')
ɦaঁ

I haven't checked extensively, but it is possible this also occurs with other languages and diacritics.

Usage of BCP 47 tags with ASR systems

I am curious to get a sense of what other researchers feel about the use of BCP47 tags for speech recognition models. And at what level (data ID, on training data, or the model itself, or out put from the model? Read more about BCP47 here: https://www.w3.org/International/articles/language-tags/
https://tools.ietf.org/html/bcp47

Some months ago I was on the IETF mailing list for sub-tags, and suggest that Speech to text and text to speech models should have tags identifying them. But there didn't seem to be any great "ah Ha's" from that crowd.

question - Method for Arpabet conversion?

First of all, thanks for making this great software. Works perfect for me.
Also adding rules is explained very clearly and I could implement it with ease.

I am parsing, then converting a dutch wordlist to ipa and xsampa, trying to generate a dict for building voices. I saw there's a arpabet mapping too, which would be handy training sphinx. Should I create a class, and ipa2arpa.csv like you did for the xsampa conversion?

I am now using xsampa like this:

`from epitran.xsampa import XSampa

#set to dutch
epi = epitran.Epitran('nld-Latn')

#x-sampa class
xs = XSampa()

s = epi.transliterate( word ).encode("utf-8")
s_a = xs.ipa2xs( unicode(s, "utf-8") )
`
So I could also make a class like xsampa for ipa2arpa, or there is a simpler way?

KeyError

When will this problem be fixed? Thank you very much!

The same word is transliterated differently

I got a strange transliteration for Italian:

abiud d͡ʒenerɔ eliat͡ʃim eliat͡ʃim ɡenerɔ asor

ɡenerɔ should be like d͡ʒenerɔ. This happens if the string is part of a much larger string, but not when it is transliterated in isolation (i.e., only that string).

keyerror on any english transliteration

I wanted to give this a whirl but hit some speedbump from the getgo :

In [2]: epi = epitran.Epitran('eng-Latn')
In [3]: epi.transliterate('iceland')
WARNING:root:lex_lookup (from flite) is not installed.
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-3-54aaf7e8072d> in <module>()
----> 1 epi.transliterate('iceland')

/home/jeremy/.local/lib/python3.6/site-packages/epitran/_epitran.py in transliterate(self, word, normpunc, ligatures)
     60             unicode: IPA string
     61         """
---> 62         return self.epi.transliterate(word, normpunc, ligatures)
     63 
     64     def reverse_transliterate(self, ipa):

/home/jeremy/.local/lib/python3.6/site-packages/epitran/flite.py in transliterate(self, text, normpunc, ligatures)
     89         for chunk in self.chunk_re.findall(text):
     90             if self.letter_re.match(chunk):
---> 91                 acc.append(self.english_g2p(chunk))
     92             else:
     93                 acc.append(chunk)

/home/jeremy/.local/lib/python3.6/site-packages/epitran/flite.py in english_g2p(self, text)
    205             logging.warning('Non-zero exit status from lex_lookup.')
    206             arpa_text = ''
--> 207         return self.arpa_to_ipa(arpa_text)

/home/jeremy/.local/lib/python3.6/site-packages/epitran/flite.py in arpa_to_ipa(self, arpa_text, ligatures)
     73         arpa_list = map(lambda d: re.sub('\d', '', d), arpa_list)
     74         ipa_list = map(lambda d: self.arpa_map[d], arpa_list)
---> 75         text = ''.join(ipa_list)
     76         return text
     77 

/home/jeremy/.local/lib/python3.6/site-packages/epitran/flite.py in <lambda>(d)
     72         arpa_list = self.arpa_text_to_list(arpa_text)
     73         arpa_list = map(lambda d: re.sub('\d', '', d), arpa_list)
---> 74         ipa_list = map(lambda d: self.arpa_map[d], arpa_list)
     75         text = ''.join(ipa_list)
     76         return text

KeyError: ''

In [4]: epi2 = epitran.Epitran('rus-Cyrl')

In [7]: epi.transliterate('')
Out[7]: ''

In [9]: epi2.transliterate('Приве́т')
Out[9]: 'prʲivʲét'

English IPA translation

I am trying to use Epitran to create IPA conversions for English sentences, and it doesn't produce results I expect for some common words.

import epitran
epi = epitran.Epitran('eng-Latn')
epi.transliterate("was does buzz")
'wɑz dowz bʌz'

Note that IPA for does contains a w. Looking through dictionaries, I find ˈdəz, dɪz. When all three are put into a simple IPA reader, the dictionary versions sound correct and epitran's translation sounds wrong.

When i use epitran--English - Latn, it tells that root:lex_lookup (from flite) is not installed. But I have installed lex_lookup

When i use epitran--English - Latn, it tells that root:lex_lookup (from flite) is not installed. But I have installed lex_lookup.

(base) [root@host-10-29-0-161 testsuite]# make lex_lookup
Makefile:83: warning: overriding recipe for target multi_thread' Makefile:80: warning: ignoring old recipe for target multi_thread'
make: `lex_lookup' is up to date.

Error:

import epitran
epi = epitran.Epitran('eng-Latn')
print(epi.transliterate(u'Berkeley'))
WARNING:root:lex_lookup (from flite) is not installed.
Traceback (most recent call last):
File "", line 1, in
File "/opt/huawei/data1/z00574176/G2P/git_reproduced/epitran-master/epitran/_epitran.py", line 62, in transliterate
return self.epi.transliterate(word, normpunc, ligatures)
File "/opt/huawei/data1/z00574176/G2P/git_reproduced/epitran-master/epitran/flite.py", line 96, in transliterate
acc.append(self.english_g2p(chunk))
File "/opt/huawei/data1/z00574176/G2P/git_reproduced/epitran-master/epitran/flite.py", line 214, in english_g2p
arpa_text = arpa_text.splitlines()[0]
IndexError: list index out of range

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.