dmort27 / epitran Goto Github PK

View Code? Open in Web Editor NEW

602.0 22.0 121.0 2.79 MB

A tool for transcribing orthographic text as IPA (International Phonetic Alphabet)

License: MIT License

Python 99.77% Shell 0.05% TeX 0.18%

epitran's Introduction

Epitran

A library and tool for transliterating orthographic text as IPA (International Phonetic Alphabet).

Usage

The Python modules epitran and epitran.vector can be used to easily write more sophisticated Python programs for deploying the Epitran mapping tables, preprocessors, and postprocessors. This is documented below.

If you wish to use Epitran to convert English to IPA, you must install the Flite (including lex_lookup) as detailed below.

Using the `epitran` Module

The Epitran class

The most general functionality in the epitran module is encapsulated in the very simple Epitran class:

Epitran(code, preproc=True, postproc=True, ligatures=False, cedict_file=None).

Its constructor takes one argument, code, the ISO 639-3 code of the language to be transliterated plus a hyphen plus a four letter code for the script (e.g. 'Latn' for Latin script, 'Cyrl' for Cyrillic script, and 'Arab' for a Perso-Arabic script). It also takes optional keyword arguments:

preproc and postproc enable pre- and post-processors. These are enabled by default.
ligatures enables non-standard IPA ligatures like "ʤ" and "ʨ".
cedict_file gives the path to the CC-CEDict dictionary file (relevant only when working with Mandarin Chinese and which, because of licensing restrictions cannot be distributed with Epitran).
tones allows IPA tones (˩˨˧˦˥) to be included and is needed for tonal languages like Vietnamese and Hokkien. By default, this value is false and will remove IPA tones from the transcription.
For more options, type help(epitran.Epitran.__init__) into a Python terminal session

>>> import epitran
>>> epi = epitran.Epitran('uig-Arab')  # Uyghur in Perso-Arabic script

It is now possible to use the Epitran class for English and Mandarin Chinese (Simplified and Traditional) G2P as well as the other langugages that use Epitran's "classic" model. For Chinese, it is necessary to point the constructor to a copy of the CC-CEDict dictionary:

>>> import epitran
>>> epi = epitran.Epitran('cmn-Hans', cedict_file='cedict_1_0_ts_utf-8_mdbg.txt')

The most useful public method of the Epitran class is transliterate:

Epitran.transliterate(text, normpunc=False, ligatures=False). Convert text (in Unicode-encoded orthography of the language specified in the constructor) to IPA, which is returned. normpunc enables punctuation normalization and ligatures enables non-standard IPA ligatures like "ʤ" and "ʨ". Usage is illustrated below (Python 2):

>>> epi.transliterate(u'Düğün')
u'dy\u0270yn'
>>> print(epi.transliterate(u'Düğün'))
dyɰyn

Epitran.word_to_tuples(word, normpunc=False): Takes a word (a Unicode string) in a supported orthography as input and returns a list of tuples with each tuple corresponding to an IPA segment of the word. The tuples have the following structure:

(
    character_category :: String,
    is_upper :: Integer,
    orthographic_form :: Unicode String,
    phonetic_form :: Unicode String,
    segments :: List<Tuples>
)

Note that word_to_tuples is not implemented for all language-script pairs.

The codes for character_category are from the initial characters of the two character sequences listed in the "General Category" codes found in Chapter 4 of the Unicode Standard. For example, "L" corresponds to letters and "P" corresponds to production marks. The above data structure is likely to change in subsequent versions of the library. The structure of segments is as follows:

(
    segment :: Unicode String,
    vector :: List<Integer>
)

Here is an example of an interaction with word_to_tuples (Python 2):

>>> import epitran
>>> epi = epitran.Epitran('tur-Latn')
>>> epi.word_to_tuples(u'Düğün')
[(u'L', 1, u'D', u'd', [(u'd', [-1, -1, 1, -1, -1, -1, -1, -1, 1, -1, -1, 1, 1, -1, -1, -1, -1, -1, -1, 0, -1])]), (u'L', 0, u'u\u0308', u'y', [(u'y', [1, 1, -1, 1, -1, -1, -1, 0, 1, -1, -1, -1, -1, -1, 1, 1, -1, -1, 1, 1, -1])]), (u'L', 0, u'g\u0306', u'\u0270', [(u'\u0270', [-1, 1, -1, 1, 0, -1, -1, 0, 1, -1, -1, 0, -1, 0, -1, 1, -1, 0, -1, 1, -1])]), (u'L', 0, u'u\u0308', u'y', [(u'y', [1, 1, -1, 1, -1, -1, -1, 0, 1, -1, -1, -1, -1, -1, 1, 1, -1, -1, 1, 1, -1])]), (u'L', 0, u'n', u'n', [(u'n', [-1, 1, 1, -1, -1, -1, 1, -1, 1, -1, -1, 1, 1, -1, -1, -1, -1, -1, -1, 0, -1])])]

The Backoff class

Sometimes, when parsing text in more than one script, it is useful to employ a graceful backoff. If one language mode does not work, it can be useful to fall back to another, and so on. This functionality is provided by the Backoff class:

Backoff(lang_script_codes, cedict_file=None)

Note that the Backoff class does not currently support parameterized preprocessor and postprocessor application and does not support non-standard ligatures. It also does not support punctuation normalization. lang_script_codes is a list of codes like eng-Latn or hin-Deva. For example, if one was transcribing a Hindi text with many English loanwords and some stray characters of Simplified Chinese, one might use the following code (Python 3):

from epitran.backoff import Backoff
>>> backoff = Backoff(['hin-Deva', 'eng-Latn', 'cmn-Hans'], cedict_file=‘cedict_1_0_ts_utf-8_mdbg.txt')
>>> backoff.transliterate('हिन्दी')
'ɦindiː'
>>> backoff.transliterate('English')
'ɪŋɡlɪʃ'
>>> backoff.transliterate('中文')
'ʈ͡ʂoŋwən'

Backoff works on a token-by-token basis: tokens that contain mixed scripts will be returned as the empty string, since they cannot be fully converted by any of the modes.

The Backoff class has the following public methods:

transliterate: returns a unicode string of IPA phonemes
trans_list: returns a list of IPA unicode strings, each of which is a phoneme
xsampa_list: returns a list of X-SAMPA (ASCII) strings, each of which is phoneme

Consider the following example (Python 3):

>>> backoff.transliterate('हिन्दी')
'ɦindiː'
>>> backoff.trans_list('हिन्दी')
['ɦ', 'i', 'n', 'd', 'iː']
>>> backoff.xsampa_list('हिन्दी')
['h\\', 'i', 'n', 'd', 'i:']

DictFirst

The DictFirst class provides a simple alternative to the Backoff class. It requires a dictionary of words known to be of Language A, one word per line in a UTF-8 encoded text file. It accepts three arguments: the language-script code for Language A, that for Language B, and a path to the dictionary file. It has one public method, transliteration, which works like Epitran.transliterate except that it returns the transliteration for Language A if the input token is in the dictionary; otherwise, it returns the Language B transliteration of the token:

>>> import dictfirst
>>> df = dictfirst.DictFirst('tpi-Latn', 'eng-Latn', '../sample-dict.txt')
>>> df.transliterate('pela')
'pela'
>>> df.transliterate('pelo')
'pɛlow'

Preprocessors, postprocessors, and their pitfalls

In order to build a maintainable orthography to phoneme mapper, it is sometimes necessary to employ preprocessors that make contextual substitutions of symbols before text is passed to a orthography-to-IPA mapping system that preserves relationships between input and output characters. This is particularly true of languages with a poor sound-symbols correspondence (like French and English). Languages like French are particularly good targets for this approach because the pronunciation of a given string of letters is highly predictable even though the individual symbols often do not map neatly into sounds. (Sound-symbol correspondence is so poor in English that effective English G2P systems rely heavily on pronouncing dictionaries.)

Preprocessing the inputs words to allow for straightforward grapheme-to-phoneme mappings (as is done in the current version of epitran for some languages) is advantageous because the restricted regular expression language used to write the preprocessing rules is more powerful than the language for the mapping rules and allows the equivalent of many mapping rules to be written with a single rule. Without them, providing epitran support for languages like French and German would not be practical. However, they do present some problems. Specifically, when using a language with a preprocessor, one must be aware that the input word will not always be identical to the concatenation of the orthographic strings (orthographic_form) output by Epitran.word_to_tuples. Instead, the output of word_to_tuple will reflect the output of the preprocessor, which may delete, insert, and change letters in order to allow direct orthography-to-phoneme mapping at the next step. The same is true of other methods that rely on Epitran.word_to_tuple such as VectorsWithIPASpace.word_to_segs from the epitran.vector module.

For information on writing new pre- and post-processors, see the section on "Extending Epitran with map files, preprocessors and postprocessors", below.

Using the `epitran.vector` Module

The epitran.vector module is also very simple. It contains one class, VectorsWithIPASpace, including one method of interest, word_to_segs:

The constructor for VectorsWithIPASpace takes two arguments:

code: the language-script code for the language to be processed.
spaces: the codes for the punctuation/symbol/IPA space in which the characters/segments from the data are expected to reside. The available spaces are listed below.

Its principle method is word_to_segs:

VectorWithIPASpace.word_to_segs(word, normpunc=False). word is a Unicode string. If the keyword argument normpunc is set to True, punctuation discovered in word is normalized to ASCII equivalents.

A typical interaction with the VectorsWithIPASpace object via the word_to_segs method is illustrated here (Python 2):

>>> import epitran.vector
>>> vwis = epitran.vector.VectorsWithIPASpace('uzb-Latn', ['uzb-Latn'])
>>> vwis.word_to_segs(u'darë')
[(u'L', 0, u'd', u'd\u032a', u'40', [-1, -1, 1, -1, -1, -1, -1, -1, 1, -1, -1, 1, 1, 1, -1, -1, -1, -1, -1, 0, -1]), (u'L', 0, u'a', u'a', u'37', [1, 1, -1, 1, -1, -1, -1, 0, 1, -1, -1, -1, -1, -1, -1, -1, 1, 1, -1, 1, -1]), (u'L', 0, u'r', u'r', u'54', [-1, 1, 1, 1, 0, -1, -1, -1, 1, -1, -1, 1, 1, -1, -1, 0, 0, 0, -1, 0, -1]), (u'L', 0, u'e\u0308', u'ja', u'46', [-1, 1, -1, 1, -1, -1, -1, 0, 1, -1, -1, -1, -1, 0, -1, 1, -1, -1, -1, 0, -1]), (u'L', 0, u'e\u0308', u'ja', u'37', [1, 1, -1, 1, -1, -1, -1, 0, 1, -1, -1, -1, -1, -1, -1, -1, 1, 1, -1, 1, -1])]

(It is important to note that, though the word that serves as input--darë--has four letters, the output contains five tuples because the last letter in darë actually corresponds to two IPA segments, /j/ and /a/.) The returned data structure is a list of tuples, each with the following structure:

(
    character_category :: String,
    is_upper :: Integer,
    orthographic_form :: Unicode String,
    phonetic_form :: Unicode String,
    in_ipa_punc_space :: Integer,
    phonological_feature_vector :: List<Integer>
)

A few notes are in order regarding this data structure:

character_category is defined as part of the Unicode standard (Chapter 4). It consists of a single, uppercase letter from the set {'L', 'M', 'N', 'P', 'S', 'Z', 'C'}.. The most frequent of these are 'L' (letter), 'N' (number), 'P' (punctuation), and 'Z' (separator [including separating white space]).
is_upper consists only of integers from the set {0, 1}, with 0 indicating lowercase and 1 indicating uppercase.
The integer in in_ipa_punc_space is an index to a list of known characters/segments such that, barring degenerate cases, each character or segment is assignmed a unique and globally consistant number. In cases where a character is encountered which is not in the known space, this field has the value -1.
The length of the list phonological_feature_vector should be constant for any instantiation of the class (it is based on the number of features defined in panphon) but is--in principle--variable. The integers in this list are drawn from the set {-1, 0, 1}, with -1 corresponding to '-', 0 corresponding to '0', and 1 corresponding to '+'. For characters with no IPA equivalent, all values in the list are 0.

Language Support

Transliteration Language/Script Pairs

Code	Language (Script)
aar-Latn	Afar
aii-Syrc	Assyrian Neo-Aramaic
amh-Ethi	Amharic
amh-Ethi-pp	Amharic (more phonetic)
amh-Ethi-red	Amharic (reduced)
ara-Arab	Literary Arabic
ava-Cyrl	Avaric
aze-Cyrl	Azerbaijani (Cyrillic)
aze-Latn	Azerbaijani (Latin)
ben-Beng	Bengali
ben-Beng-red	Bengali (reduced)
bxk-Latn	Bukusu
cat-Latn	Catalan
ceb-Latn	Cebuano
ces-Latn	Czech
cjy-Latn	Jin (Wiktionary)
cmn-Hans	Mandarin (Simplified)*
cmn-Hant	Mandarin (Traditional)*
cmn-Latn	Mandarin (Pinyin)*
ckb-Arab	Sorani
csb-Latn	Kashubian
deu-Latn	German
deu-Latn-np	German†
deu-Latn-nar	German (more phonetic)
eng-Latn	English‡
epo-Latn	Esperanto
fas-Arab	Farsi (Perso-Arabic)
fra-Latn	French
fra-Latn-np	French†
fra-Latn-p	French (more phonetic)
ful-Latn	Fulah
gan-Latn	Gan (Wiktionary)
got-Latn	Gothic
hak-Latn	Hakka (pha̍k-fa-sṳ)
hau-Latn	Hausa
hin-Deva	Hindi
hmn-Latn	Hmong
hrv-Latn	Croatian
hsn-Latn	Xiang (Wiktionary)
hun-Latn	Hungarian
ilo-Latn	Ilocano
ind-Latn	Indonesian
ita-Latn	Italian
jam-Latn	Jamaican
jav-Latn	Javanese
kaz-Cyrl	Kazakh (Cyrillic)
kaz-Cyrl-bab	Kazakh (Cyrillic—Babel)
kaz-Latn	Kazakh (Latin)
kbd-Cyrl	Kabardian
khm-Khmr	Khmer
kin-Latn	Kinyarwanda
kir-Arab	Kyrgyz (Perso-Arabic)
kir-Cyrl	Kyrgyz (Cyrillic)
kir-Latn	Kyrgyz (Latin)
kmr-Latn	Kurmanji
kmr-Latn-red	Kurmanji (reduced)
lao-Laoo	Lao
lij-Latn	Ligurian
lsm-Latn	Saamia
ltc-Latn-bax	Middle Chinese (Baxter and Sagart 2014)
mal-Mlym	Malayalam
mar-Deva	Marathi
mlt-Latn	Maltese
mon-Cyrl-bab	Mongolian (Cyrillic)
mri-Latn	Maori
msa-Latn	Malay
mya-Mymr	Burmese
nan-Latn	Hokkien (pe̍h-oē-jī)
nan-Latn-tl	Hokkien (Tâi-lô)
nld-Latn	Dutch
nya-Latn	Chichewa
ood-Lat-alv	Tohono O'odham
ood-Latn-sax	Tohono O'odham
ori-Orya	Odia
orm-Latn	Oromo
pan-Guru	Punjabi (Eastern)
pol-Latn	Polish
por-Latn	Portuguese
quy-Latn	Ayacucho Quechua / Quechua Chanka
ron-Latn	Romanian
run-Latn	Rundi
rus-Cyrl	Russian
sag-Latn	Sango
sin-Sinh	Sinhala
sna-Latn	Shona
som-Latn	Somali
spa-Latn	Spanish
spa-Latn-eu	Spanish (Iberian)
sqi-Latn	Albanian
srp-Latn	Serbian
swa-Latn	Swahili
swa-Latn-red	Swahili (reduced)
swe-Latn	Swedish
tam-Taml	Tamil
tam-Taml-red	Tamil (reduced)
tel-Telu	Telugu
tgk-Cyrl	Tajik
tgl-Latn	Tagalog
tgl-Latn-red	Tagalog (reduced)
tha-Thai	Thai
tir-Ethi	Tigrinya
tir-Ethi-pp	Tigrinya (more phonemic)
tir-Ethi-red	Tigrinya (reduced)
tpi-Latn	Tok Pisin
tuk-Cyrl	Turkmen (Cyrillic)
tuk-Latn	Turkmen (Latin)
tur-Latn	Turkish (Latin)
tur-Latn-bab	Turkish (Latin—Babel)
tur-Latn-red	Turkish (reduced)
ukr-Cyrl	Ukranian
urd-Arab	Urdu
uig-Arab	Uyghur (Perso-Arabic)
uzb-Cyrl	Uzbek (Cyrillic)
uzb-Latn	Uzbek (Latin)
vie-Latn	Vietnamese
wuu-Latn	Shanghainese Wu (Wiktionary)
xho-Latn	Xhosa
yor-Latn	Yoruba
yue-Latn	Cantonese
zha-Latn	Zhuang
zul-Latn	Zulu

*Chinese G2P requires the freely available CC-CEDict dictionary.

†These language preprocessors and maps naively assume a phonemic orthography.

‡English G2P requires the installation of the freely available CMU Flite speech synthesis system.

Languages with limited support due to highly ambiguous orthographies

Some the languages listed above should be approached with caution. It is not possible to provide highly accurate support for these language-script pairs due to the high degree of ambiguity inherent in the orthographies. Eventually, we plan to support these languages with a different back end based on WFSTs or neural methods.

Code	Language (Script)
ara-Arab	Arabic
cat-Latn	Catalan
ckb-Arab	Sorani
fas-Arab	Farsi (Perso-Arabic)
fra-Latn	French
fra-Latn-np	French†
mya-Mymr	Burmese
por-Latn	Portuguese

Language "Spaces"

Code	Language	Note
amh-Ethi	Amharic
deu-Latn	German
eng-Latn	English
nld-Latn	Dutch
spa-Latn	Spanish
tur-Latn	Turkish	Based on data with suffixes attached
tur-Latn-nosuf	Turkish	Based on data with suffixes removed
uzb-Latn-suf	Uzbek	Based on data with suffixes attached

Note that major languages, including French, are missing from this table due to a lack of appropriate text data.

Installation of Flite (for English G2P)

For use with most languages, Epitran requires no special installation steps. It can be installed as an ordinarary python package, either with pip or by running python setup.py install in the root of the source directory. However, English G2P in Epitran relies on CMU Flite, a speech synthesis package by Alan Black and other speech researchers at Carnegie Mellon University. For the current version of Epitran, you should follow the installation instructions for lex_lookup, which is used as the default G2P interface for Epitran.

`t2p`

Not recommended This interface to Flite is now deprecated; Use lex_lookup.

`lex_lookup`

Recommended

t2p does not behave as expected on letter sequences that are highly infrequent in English. In such cases, t2p gives the pronunciation of the English letters of the name, rather than an attempt at the pronunciation of the name. There is a different binary included in the most recent (pre-release) versions of Flite that behaves better in this regard, but takes some extra effort to install. To install, you need to obtain at least version 2.0.5 of Flite. We recommend that you obtain the source from GitHub (https://github.com/festvox/flite). Untar and compile the source, following the steps below, adjusting where appropriate for your system:

$ tar xjf flite-2.0.5-current.tar.bz2
$ cd flite-2.0.5-current

$ git clone [email protected]:festvox/flite.git
$ cd flite/

then

$ ./configure && make
$ sudo make install
$ cd testsuite
$ make lex_lookup
$ sudo cp lex_lookup /usr/local/bin

When installing on MacOS and other systems that use a BSD version of cp, some modification to a Makefile must be made in order to install flite-2.0.5 (between steps 3 and 4). Edit main/Makefile and change both instances of cp -pd to cp -pR. Then resume the steps above at step 4.

Usage

To use lex_lookup, simply instantiate Epitran as usual, but with the code set to 'eng-Latn':

>>> import epitran
>>> epi = epitran.Epitran('eng-Latn')
>>> print epi.transliterate(u'Berkeley')
bɹ̩kli

Extending Epitran with map files, preprocessors and postprocessors

Language support in Epitran is provided through map files, which define mappings between orthographic and phonetic units, preprocessors that run before the map is applied, and postprocessors that run after the map is applied. Maps are defined in UTF8-encoded, comma-delimited value (CSV) files. The files are each named <iso639>-<iso15924>.csv where <iso639> is the (three letter, all lowercase) ISO 639-3 code for the language and <iso15924> is the (four letter, capitalized) ISO 15924 code for the script. These files reside in the data directory of the Epitran installation under the map, pre, and post subdirectories, respectively. The pre- and post-processor files are text files whose format is described belown. They follow the same naming conventions except that they have the file extensions .txt.

Map files (mapping tables)

The map files are simple, two-column files where the first column contains the orthgraphic characters/sequences and the second column contains the phonetic characters/sequences. The two columns are separated by a comma; each row is terminated by a newline. For many languages (most languages with unambiguous, phonemically adequate orthographies) just this easy-to-produce mapping file is adequate to produce a serviceable G2P system.

The first row is a header and is discarded. For consistency, it should contain the fields "Orth" and "Phon". The following rows by consist of fields of any length, separated by a comma. The same phonetic form (the second field) may occur any number of times but an orthographic form may only occur once. Where one orthograrphic form is a prefix of another form, the longer form has priority in mapping. In other words, matching between orthographic units and orthographic strings is greedy. Mapping works by finding the longest prefix of the orthographic form and adding the corresponding phonetic string to the end of the phonetic form, then removing the prefix from the orthographic form and continuing, in the same manner, until the orthographic form is consumed. If no non-empty prefix of the orthographic form is present in the mapping table, the first character in the orthographic form is removed and appended to the phonetic form. The normal sequence then resumes. This means that non-phonetic characters may end up in the "phonetic" form, which we judge to be better than losing information through an inadequate mapping table.

Preprocesssors and postprocessors

For language-script pairs with more complicated orthographies, it is sometimes necessary to manipulate the orthographic form prior to mapping or to manipulate the phonetic form after mapping. This is done, in Epitran, with grammars of context-sensitive string rewrite rules. In truth, these rules would be more than adequate to solve the mapping problem as well but in practical terms, it is usually easier to let easy-to-understand and easy-to-maintain mapping files carry most of the weight of conversion and reserve the more powerful context sensitive grammar formalism for pre- and post-processing.

The preprocessor and postprocessor files have the same format. They consist of a sequence of lines, each consisting of one of four types:

Symbol definitions
Context-sensitive rewrite rules
Comments
Blank lines

Symbol definitions

Lines like the following

::vowels:: = a|e|i|o|u

define symbols that can be reused in writing rules. Symbols must consist of a prefix of two colons, a sequence of one or more lowercase letters and underscores, and a suffix of two colons. The are separated from their definitions by the equals sign (optionally set off with white space). The definition consists of a substring from a regular expression.

Symbols must be defined before they are referenced.

Rewrite rules

Context-sensitive rewrite rules in Epitran are written in a format familiar to phonologists but transparent to computer scientists. They can be schematized as

a -> b / X _ Y

which can be rewitten as

XaY → XbY

The arrow -> can be read as "is rewritten as" and the slash / can be read as "in the context". The underscore indicates the position of the symbol(s) being rewritten. Another special symbol is the octothorp #, which indicates the beginning or end of a (word length) string (a word boundary). Consider the following rule:

e -> ə / _ #

This rule can be read as "/e/ is rewritten as /ə/ in the context at the end of the word." A final special symbol is zero 0, which represents the empty string. It is used in rules that insert or delete segments. Consider the following rule that deletes /ə/ between /k/ and /l/:

ə　-> 0 / k _ l

All rules must include the arrow operator, the slash operator, and the underscore. A rule that applies in a context-free fashion can be written in the following way:

ch -> x / _

The implementation of context-sensitive rules in Epitran pre- and post-processors uses regular expression replacement. Specifically, it employs the regex package, a drop-in replacement for re. Because of this, regular expression notation can be used in writing rules:

c -> s / _ [ie]

c -> s / _ (i|e)

For a complete guide to regex regular expressions, see the documentation for re and for regex, specifically.

Fragments of regular expressions can be assigned to symbols and reused throughout a file. For example, symbol for the disjunction of vowels in a language can be used in a rule that changes /u/ into /w/ before vowels:

::vowels:: = a|e|i|o|u
...
u -> w / _ (::vowels::)

There is a special construct for handling cases of metathesis (where "AB" is replaced with "BA"). For example, the rule:

(?P<sw1>[เแโไใไ])(?P<sw2>.) -> 0 / _

Will "swap" the positions of any character in "เแโไใไ" and any following character. Left of the arrow, there should be two groups (surrounded by parentheses) with the names sw1 and sw2 (a name for a group is specified by ?P<name> appearing immediately after the open parenthesis for a group). The substrings matched by the two groups, sw1 and sw2 will be "swapped" or metathesized. The item immediately right of the arrow is ignored, but the context is not.

To move IPA tones to the end of the word, first ensure that tones=True in the instantiated Epitran object and use the following rule:

(?P<sw1>[˩˨˧˦˥]+)(?P<sw2>\w+) -> 0 / _\b

The rules apply in order, so earlier rules may "feed" and "bleed" later rules. Therefore, their sequence is very important and can be leveraged in order to achieve valuable results.

Comments and blank lines

Comments and blank lines (lines consisting only of white space) are allowed to make your code more readable. Any line in which the first non-whitespace character is a percent sign % is interpreted as comment. The rest of the line is ignored when the file is interpreted. Blank lines are also ignored.

A strategy for adding language support

Epitran uses a mapping-and-repairs approach to G2P. It is expected that there is a mapping between graphemes and phonemes that can do most of the work of converting orthographic representations to phonological representations. In phonemically adequate orthogrphies, this mapping can do all of the work. This mapping should be completed first. For many languages, a basis for this mapping table already exists on Wikipedia and Omniglot (though the Omniglot tables are typically not machine readable).

On the other hand, many writing systems deviate from the phonemically adequate idea. It is here that pre- and post-processors must be introduced. For example, in Swedish, the letter <a> receives a different pronunciation before two consonants (/ɐ/) than elsewhere (/ɑː/). It makes sense to add a preprocessor rule that rewrites <a> as /ɐ/ before two consonants (and similar rules for the other vowels, since they are affected by the same condition). Preprocessor rules should generally be employed whenever the orthographic representation must be adjusted (by contextual changes, deletions, etc.) prior to the mapping step.

One common use for postprocessors is to eliminate characters that are needed by the preprocessors or maps, but which should not appear in the output. A classic example of this is the virama used in Indic scripts. In these scripts, in order to write a consonant not followed by a vowel, one uses the form of the consonant symbol with particular inherent vowel followed by a virama (which has various names in different Indic languages). An easy way of handling this is to allow the mapping to translate the consonant into an IPA consonant + an inherent vowel (which, for a given language, will always be the same), then use the postprocessor to delete the vowel + virama sequence (wherever it occurs).

In fact, any situation where a character that is introduced by the map needs to be subsequently deleted is a good use-case for postprocessors. Another example from Indian languages includes so-called schwa deletion. Some vowels implied by a direct mapping between the orthography and the phonology are not actually pronounced; these vowels can generally be predicted. In most languages, they occur in the context after a vowel+consonant sequence and before a consonant+vowel sequence. In other words, the rule looks like the following:

ə -> 0 / (::vowel::)(::consonant::) _ (::consonant::)(::vowel::)

Perhaps the best way to learn how to structure language support for a new language is to consult the existing languages in Epitran. The French preprocessor fra-Latn.txt and the Thai postprocessor tha-Thai.txt illustrate many of the use-cases for these rules.

Citing Epitran

If you use Epitran in published work, or in other research, please use the following citation:

David R. Mortensen, Siddharth Dalmia, and Patrick Littell. 2018. Epitran: Precision G2P for many languages. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Paris, France. European Language Resources Association (ELRA).

@InProceedings{Mortensen-et-al:2018,
  author = {Mortensen, David R.  and Dalmia, Siddharth and Littell, Patrick},
  title = {Epitran: Precision {G2P} for Many Languages},
  booktitle = {Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)},
  year = {2018},
  month = {May},
  date = {7--12},
  location = {Miyazaki, Japan},
  editor = {Nicoletta Calzolari (Conference chair) and Khalid Choukri and Christopher Cieri and Thierry Declerck and Sara Goggi and Koiti Hasida and Hitoshi Isahara and Bente Maegaard and Joseph Mariani and H\'el\`ene Mazo and Asuncion Moreno and Jan Odijk and Stelios Piperidis and Takenobu Tokunaga},
  publisher = {European Language Resources Association (ELRA)},
  address = {Paris, France},
  isbn = {979-10-95546-00-9},
  language = {english}
  }

epitran's People

Contributors

Stargazers

Watchers

Forkers

siddalmia baaslaawe kaharjan pbaljeka firojalam mbencherif itsmengzaime yoks itmgr alongwithyou ccoreilly dariofranceschinipev waikimlwk g-wang oadams ggsonic aaaaaalan sztheory ebrahimebrahim mbwolff syakmoon debby294 gentaiscool badgergy piplcom xinjli munendra7777 bolatashim jeff-k m-wiesner tariqul87 nonlocal wa3dbk witty-kitty olabiyisam ml-ai-nlp-ir leyonan beeawesome jonneryr scripples irisyanfguo saber5433 ruohoruotsi dwijap jasonzhao0307 vahmax mounop mahbubnoor juliarodina lingulist taylorlu ftyers ishine dan-wells chenchy roedoejet jimregan holyma jcgoran kavyamanohar lingjzhu knzaytsev anzhelikaminchenko jakubbrojacz vinidiktov dhivehiai rasuljasirdent nebili martino-vic a-d-dasare mandolinraman pan310 kalvinchang loadfc wendonggan jjandnn ilemworld nvhoc stefantaubert dallak hritik01478 iiitv-saviour akshhack brijeshagal wu-hsuan djwyen hy310 ebell495 junimay lart-rt jeanm fleanend techthiyanes jaidjashim taiqihe avaldez1412 bgo-eiu trenslow mrmikardo aphexus

epitran's Issues

Incorrect grapheme-phoneme alignment in word_to_tuple response

Thank you for this great tool!
I was hoping to use Epitran to extract frequencies of grapheme-phoneme alignment in different languages. But I am running into issues when using the word_to_tuples and word_to_segs features.

Here is the output of epi.word_to_tuples for the word tough in English

('L', 0, 't', 't', [('t', <map object at 0x113817c50>)])
('L', 0, 'o', 'ʌ', [('ʌ', <map object at 0x113817250>)])
('L', 0, 'u', 'f', [('f', <map object at 0x1120a06d0>)])
('L', 0, 'g', '', [(-1, [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])])
('L', 0, 'h', '', [(-1, [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])])

Here is the output for choice

('L', 0, 'c', 't͡ʃ', [('t͡ʃ', <map object at 0x11380cad0>)])
('L', 0, 'h', 'o', [('o', <map object at 0x11380c5d0>)])
('L', 0, 'o', 'j', [('j', <map object at 0x11380cb10>)])
('L', 0, 'i', 's', [('s', <map object at 0x1120a0fd0>)])
('L', 0, 'c', '', [(-1, [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])])
('L', 0, 'e', '', [(-1, [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])])

I'd expect the phonetic form /f/ in tough to correspond to either g or h. And the phonetic form /s/ in choice to correspond to c. However, that's not the case. I am wondering if this is expected behavior or a bug?

Bengali script sometimes leaves Bengali characters in transcriptions

IPA transliterations of Bengali characters with Chandrabindus in them leave the Chandrabindu there, when it should be replaced with a combining tilde, the corresponding IPA character. With epitran 0.56 installed:

>>> import epitran
>>> translator = epitran.Epitran('ben-Beng')
>>> translator.transliterate('হাঁ')
ɦaঁ

I haven't checked extensively, but it is possible this also occurs with other languages and diacritics.

how to use transliterate given the emphasis?

For example За́мок or замо́к
@dmort27

question - Method for Arpabet conversion?

First of all, thanks for making this great software. Works perfect for me.
Also adding rules is explained very clearly and I could implement it with ease.

I am parsing, then converting a dutch wordlist to ipa and xsampa, trying to generate a dict for building voices. I saw there's a arpabet mapping too, which would be handy training sphinx. Should I create a class, and ipa2arpa.csv like you did for the xsampa conversion?

I am now using xsampa like this:

`from epitran.xsampa import XSampa

#set to dutch
epi = epitran.Epitran('nld-Latn')

#x-sampa class
xs = XSampa()

s = epi.transliterate( word ).encode("utf-8")
s_a = xs.ipa2xs( unicode(s, "utf-8") )
`
So I could also make a class like xsampa for ipa2arpa, or there is a simpler way?

Flite error when trying to transliterate english

This should speak for itself tbh.

Python 3.7.0, latest epitran from pypi.

Problem with transliterating English contractions

How does Epitran transliterate contractions? It seems that the package has difficulties with it. for example:

everyone's is transliterated as /ɛvɹiownz/ instead of /ɛvɹiwonz/ (o and w are wrongly reversed)
aren't is transliterated as /ɑɹənt/ instead of /ɑɹnt/. (a vowel is wrongly inserted)

Simply concatenating the contractions seem to give better results in some cases. Why is that the case?

KeyError when trying to transcribe any text

Traceback (most recent call last):
  File "d:/tcritp/tcript.py", line 3, in <module>
    tr.transliterate(u'spark')
  File "D:\Python3\lib\site-packages\epitran\_epitran.py", line 62, in transliterate
    return self.epi.transliterate(word, normpunc, ligatures)
  File "D:\Python3\lib\site-packages\epitran\flite.py", line 92, in transliterate
    acc.append(self.english_g2p(chunk))
  File "D:\Python3\lib\site-packages\epitran\flite.py", line 211, in english_g2p
    return self.arpa_to_ipa(arpa_text)
  File "D:\Python3\lib\site-packages\epitran\flite.py", line 76, in arpa_to_ipa
    text = ''.join(ipa_list)
  File "D:\Python3\lib\site-packages\epitran\flite.py", line 75, in <lambda>
    ipa_list = map(lambda d: self.arpa_map[d], arpa_list)
KeyError: ''

Source:

from epitran import Epitran
tr = Epitran('eng-Latn', cedict_file='cedict_1_0_ts_utf-8_mdbg.txt')
tr.transliterate('test')

Note: Changing 'test' to u'test' does not help.

ERROR! Related to MICORSOFT VISUAL STUDIO C++

Hello can someone help me with this error? I've already updated my Microsoft Visual Studio cause that was the first error now I am getting this. Why is this happening? Thank you!

C:\Users\LENOVO>pip install epitran
Collecting epitran
  Using cached epitran-1.8-py2.py3-none-any.whl (132 kB)
Requirement already satisfied: regex in c:\users\lenovo\appdata\local\programs\python\python38\lib\site-packages (from epitran) (2020.7.14)
Requirement already satisfied: unicodecsv in c:\users\lenovo\appdata\local\programs\python\python38\lib\site-packages (from epitran) (0.14.1)
Requirement already satisfied: setuptools in c:\users\lenovo\appdata\local\programs\python\python38\lib\site-packages (from epitran) (50.3.1)
Collecting marisa-trie
  Using cached marisa-trie-0.7.5.tar.gz (270 kB)
Collecting panphon>=0.16
  Using cached panphon-0.17-py2.py3-none-any.whl (71 kB)
Requirement already satisfied: PyYAML in c:\users\lenovo\appdata\local\programs\python\python38\lib\site-packages (from panphon>=0.16->epitran) (5.3)
Collecting editdistance
  Using cached editdistance-0.5.3.tar.gz (27 kB)
Requirement already satisfied: numpy in c:\users\lenovo\appdata\local\programs\python\python38\lib\site-packages (from panphon>=0.16->epitran) (1.18.0)
Requirement already satisfied: munkres in c:\users\lenovo\appdata\local\programs\python\python38\lib\site-packages (from panphon>=0.16->epitran) (1.1.4)
Building wheels for collected packages: marisa-trie, editdistance
  Building wheel for marisa-trie (setup.py) ... error
  ERROR: Command errored out with exit status 1:
   command: 'c:\users\lenovo\appdata\local\programs\python\python38\python.exe' -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'C:\\Users\\LENOVO\\AppData\\Local\\Temp\\pip-install-n07jydhk\\marisa-trie\\setup.py'"'"'; __file__='"'"'C:\\Users\\LENOVO\\AppData\\Local\\Temp\\pip-install-n07jydhk\\marisa-trie\\setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' bdist_wheel -d 'C:\Users\LENOVO\AppData\Local\Temp\pip-wheel-51uiu7_u'
       cwd: C:\Users\LENOVO\AppData\Local\Temp\pip-install-n07jydhk\marisa-trie\
  Complete output (23 lines):
  running bdist_wheel
  running build
  running build_clib
  building 'libmarisa-trie' library
  creating build
  creating build\temp.win-amd64-3.8
  creating build\temp.win-amd64-3.8\marisa-trie
  creating build\temp.win-amd64-3.8\marisa-trie\lib
  creating build\temp.win-amd64-3.8\marisa-trie\lib\marisa
  creating build\temp.win-amd64-3.8\marisa-trie\lib\marisa\grimoire
  creating build\temp.win-amd64-3.8\marisa-trie\lib\marisa\grimoire\io
  creating build\temp.win-amd64-3.8\marisa-trie\lib\marisa\grimoire\trie
  creating build\temp.win-amd64-3.8\marisa-trie\lib\marisa\grimoire\vector
  C:\Program Files (x86)\Microsoft Visual Studio\2019\BuildTools\VC\Tools\MSVC\14.27.29110\bin\HostX86\x64\cl.exe /c /nologo /Ox /W3 /GL /DNDEBUG /MD -Imarisa-trie\lib -Imarisa-trie\include "-IC:\Program Files (x86)\Microsoft Visual Studio\2019\BuildTools\VC\Tools\MSVC\14.27.29110\include" "-IC:\Program Files (x86)\Windows Kits\NETFXSDK\4.8\include\um" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.10240.0\ucrt" /EHsc /Tpmarisa-trie\lib\marisa\agent.cc /Fobuild\temp.win-amd64-3.8\marisa-trie\lib\marisa\agent.obj
  agent.cc
  C:\Program Files (x86)\Microsoft Visual Studio\2019\BuildTools\VC\Tools\MSVC\14.27.29110\bin\HostX86\x64\cl.exe /c /nologo /Ox /W3 /GL /DNDEBUG /MD -Imarisa-trie\lib -Imarisa-trie\include "-IC:\Program Files (x86)\Microsoft Visual Studio\2019\BuildTools\VC\Tools\MSVC\14.27.29110\include" "-IC:\Program Files (x86)\Windows Kits\NETFXSDK\4.8\include\um" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.10240.0\ucrt" /EHsc /Tpmarisa-trie\lib\marisa\keyset.cc /Fobuild\temp.win-amd64-3.8\marisa-trie\lib\marisa\keyset.obj
  keyset.cc
  C:\Program Files (x86)\Microsoft Visual Studio\2019\BuildTools\VC\Tools\MSVC\14.27.29110\bin\HostX86\x64\cl.exe /c /nologo /Ox /W3 /GL /DNDEBUG /MD -Imarisa-trie\lib -Imarisa-trie\include "-IC:\Program Files (x86)\Microsoft Visual Studio\2019\BuildTools\VC\Tools\MSVC\14.27.29110\include" "-IC:\Program Files (x86)\Windows Kits\NETFXSDK\4.8\include\um" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.10240.0\ucrt" /EHsc /Tpmarisa-trie\lib\marisa\trie.cc /Fobuild\temp.win-amd64-3.8\marisa-trie\lib\marisa\trie.obj
  trie.cc
  C:\Program Files (x86)\Microsoft Visual Studio\2019\BuildTools\VC\Tools\MSVC\14.27.29110\bin\HostX86\x64\cl.exe /c /nologo /Ox /W3 /GL /DNDEBUG /MD -Imarisa-trie\lib -Imarisa-trie\include "-IC:\Program Files (x86)\Microsoft Visual Studio\2019\BuildTools\VC\Tools\MSVC\14.27.29110\include" "-IC:\Program Files (x86)\Windows Kits\NETFXSDK\4.8\include\um" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.10240.0\ucrt" /EHsc /Tpmarisa-trie\lib\marisa/grimoire/io\mapper.cc /Fobuild\temp.win-amd64-3.8\marisa-trie\lib\marisa/grimoire/io\mapper.obj
  mapper.cc
  marisa-trie\lib\marisa/grimoire/io\mapper.cc(4): fatal error C1083: Cannot open include file: 'windows.h': No such file or directory
  error: command 'C:\\Program Files (x86)\\Microsoft Visual Studio\\2019\\BuildTools\\VC\\Tools\\MSVC\\14.27.29110\\bin\\HostX86\\x64\\cl.exe' failed with exit status 2
  ----------------------------------------
  ERROR: Failed building wheel for marisa-trie
  Running setup.py clean for marisa-trie
  Building wheel for editdistance (setup.py) ... error
  ERROR: Command errored out with exit status 1:
   command: 'c:\users\lenovo\appdata\local\programs\python\python38\python.exe' -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'C:\\Users\\LENOVO\\AppData\\Local\\Temp\\pip-install-n07jydhk\\editdistance\\setup.py'"'"'; __file__='"'"'C:\\Users\\LENOVO\\AppData\\Local\\Temp\\pip-install-n07jydhk\\editdistance\\setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' bdist_wheel -d 'C:\Users\LENOVO\AppData\Local\Temp\pip-wheel-cibnnpm3'
       cwd: C:\Users\LENOVO\AppData\Local\Temp\pip-install-n07jydhk\editdistance\
  Complete output (30 lines):
  running bdist_wheel
  running build
  running build_py
  creating build
  creating build\lib.win-amd64-3.8
  creating build\lib.win-amd64-3.8\editdistance
  copying editdistance\__init__.py -> build\lib.win-amd64-3.8\editdistance
  copying editdistance\_editdistance.h -> build\lib.win-amd64-3.8\editdistance
  copying editdistance\def.h -> build\lib.win-amd64-3.8\editdistance
  running build_ext
  building 'editdistance.bycython' extension
  creating build\temp.win-amd64-3.8
  creating build\temp.win-amd64-3.8\Release
  creating build\temp.win-amd64-3.8\Release\editdistance
  C:\Program Files (x86)\Microsoft Visual Studio\2019\BuildTools\VC\Tools\MSVC\14.27.29110\bin\HostX86\x64\cl.exe /c /nologo /Ox /W3 /GL /DNDEBUG /MD -I./editdistance -Ic:\users\lenovo\appdata\local\programs\python\python38\include -Ic:\users\lenovo\appdata\local\programs\python\python38\include "-IC:\Program Files (x86)\Microsoft Visual Studio\2019\BuildTools\VC\Tools\MSVC\14.27.29110\include" "-IC:\Program Files (x86)\Windows Kits\NETFXSDK\4.8\include\um" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.10240.0\ucrt" /EHsc /Tpeditdistance/_editdistance.cpp /Fobuild\temp.win-amd64-3.8\Release\editdistance/_editdistance.obj
  _editdistance.cpp
  editdistance/_editdistance.cpp(91): warning C4267: 'initializing': conversion from 'size_t' to 'unsigned int', possible loss of data
  editdistance/_editdistance.cpp(119): note: see reference to function template instantiation 'unsigned int edit_distance_map_<1>(const int64_t *,const size_t,const int64_t *,const size_t)' being compiled
  editdistance/_editdistance.cpp(92): warning C4267: 'initializing': conversion from 'size_t' to 'unsigned int', possible loss of data
  editdistance/_editdistance.cpp(44): warning C4018: '<=': signed/unsigned mismatch
  editdistance/_editdistance.cpp(97): note: see reference to function template instantiation 'unsigned int edit_distance_bpv<cmap_v,varr<1>>(T &,const int64_t *,const size_t &,const unsigned int &,const unsigned int &)' being compiled
          with
          [
              T=cmap_v
          ]
  editdistance/_editdistance.cpp(119): note: see reference to function template instantiation 'unsigned int edit_distance_map_<1>(const int64_t *,const size_t,const int64_t *,const size_t)' being compiled
  C:\Program Files (x86)\Microsoft Visual Studio\2019\BuildTools\VC\Tools\MSVC\14.27.29110\bin\HostX86\x64\cl.exe /c /nologo /Ox /W3 /GL /DNDEBUG /MD -I./editdistance -Ic:\users\lenovo\appdata\local\programs\python\python38\include -Ic:\users\lenovo\appdata\local\programs\python\python38\include "-IC:\Program Files (x86)\Microsoft Visual Studio\2019\BuildTools\VC\Tools\MSVC\14.27.29110\include" "-IC:\Program Files (x86)\Windows Kits\NETFXSDK\4.8\include\um" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.10240.0\ucrt" /EHsc /Tpeditdistance/bycython.cpp /Fobuild\temp.win-amd64-3.8\Release\editdistance/bycython.obj
  bycython.cpp
  c:\users\lenovo\appdata\local\programs\python\python38\include\pyconfig.h(206): fatal error C1083: Cannot open include file: 'basetsd.h': No such file or directory
  error: command 'C:\\Program Files (x86)\\Microsoft Visual Studio\\2019\\BuildTools\\VC\\Tools\\MSVC\\14.27.29110\\bin\\HostX86\\x64\\cl.exe' failed with exit status 2
  ----------------------------------------
  ERROR: Failed building wheel for editdistance
  Running setup.py clean for editdistance
Failed to build marisa-trie editdistance
Installing collected packages: marisa-trie, editdistance, panphon, epitran
    Running setup.py install for marisa-trie ... error
    ERROR: Command errored out with exit status 1:
     command: 'c:\users\lenovo\appdata\local\programs\python\python38\python.exe' -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'C:\\Users\\LENOVO\\AppData\\Local\\Temp\\pip-install-n07jydhk\\marisa-trie\\setup.py'"'"'; __file__='"'"'C:\\Users\\LENOVO\\AppData\\Local\\Temp\\pip-install-n07jydhk\\marisa-trie\\setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' install --record 'C:\Users\LENOVO\AppData\Local\Temp\pip-record-teibnt8o\install-record.txt' --single-version-externally-managed --compile --install-headers 'c:\users\lenovo\appdata\local\programs\python\python38\Include\marisa-trie'
         cwd: C:\Users\LENOVO\AppData\Local\Temp\pip-install-n07jydhk\marisa-trie\
    Complete output (23 lines):
    running install
    running build
    running build_clib
    building 'libmarisa-trie' library
    creating build
    creating build\temp.win-amd64-3.8
    creating build\temp.win-amd64-3.8\marisa-trie
    creating build\temp.win-amd64-3.8\marisa-trie\lib
    creating build\temp.win-amd64-3.8\marisa-trie\lib\marisa
    creating build\temp.win-amd64-3.8\marisa-trie\lib\marisa\grimoire
    creating build\temp.win-amd64-3.8\marisa-trie\lib\marisa\grimoire\io
    creating build\temp.win-amd64-3.8\marisa-trie\lib\marisa\grimoire\trie
    creating build\temp.win-amd64-3.8\marisa-trie\lib\marisa\grimoire\vector
    C:\Program Files (x86)\Microsoft Visual Studio\2019\BuildTools\VC\Tools\MSVC\14.27.29110\bin\HostX86\x64\cl.exe /c /nologo /Ox /W3 /GL /DNDEBUG /MD -Imarisa-trie\lib -Imarisa-trie\include "-IC:\Program Files (x86)\Microsoft Visual Studio\2019\BuildTools\VC\Tools\MSVC\14.27.29110\include" "-IC:\Program Files (x86)\Windows Kits\NETFXSDK\4.8\include\um" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.10240.0\ucrt" /EHsc /Tpmarisa-trie\lib\marisa\agent.cc /Fobuild\temp.win-amd64-3.8\marisa-trie\lib\marisa\agent.obj
    agent.cc
    C:\Program Files (x86)\Microsoft Visual Studio\2019\BuildTools\VC\Tools\MSVC\14.27.29110\bin\HostX86\x64\cl.exe /c /nologo /Ox /W3 /GL /DNDEBUG /MD -Imarisa-trie\lib -Imarisa-trie\include "-IC:\Program Files (x86)\Microsoft Visual Studio\2019\BuildTools\VC\Tools\MSVC\14.27.29110\include" "-IC:\Program Files (x86)\Windows Kits\NETFXSDK\4.8\include\um" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.10240.0\ucrt" /EHsc /Tpmarisa-trie\lib\marisa\keyset.cc /Fobuild\temp.win-amd64-3.8\marisa-trie\lib\marisa\keyset.obj
    keyset.cc
    C:\Program Files (x86)\Microsoft Visual Studio\2019\BuildTools\VC\Tools\MSVC\14.27.29110\bin\HostX86\x64\cl.exe /c /nologo /Ox /W3 /GL /DNDEBUG /MD -Imarisa-trie\lib -Imarisa-trie\include "-IC:\Program Files (x86)\Microsoft Visual Studio\2019\BuildTools\VC\Tools\MSVC\14.27.29110\include" "-IC:\Program Files (x86)\Windows Kits\NETFXSDK\4.8\include\um" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.10240.0\ucrt" /EHsc /Tpmarisa-trie\lib\marisa\trie.cc /Fobuild\temp.win-amd64-3.8\marisa-trie\lib\marisa\trie.obj
    trie.cc
    C:\Program Files (x86)\Microsoft Visual Studio\2019\BuildTools\VC\Tools\MSVC\14.27.29110\bin\HostX86\x64\cl.exe /c /nologo /Ox /W3 /GL /DNDEBUG /MD -Imarisa-trie\lib -Imarisa-trie\include "-IC:\Program Files (x86)\Microsoft Visual Studio\2019\BuildTools\VC\Tools\MSVC\14.27.29110\include" "-IC:\Program Files (x86)\Windows Kits\NETFXSDK\4.8\include\um" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.10240.0\ucrt" /EHsc /Tpmarisa-trie\lib\marisa/grimoire/io\mapper.cc /Fobuild\temp.win-amd64-3.8\marisa-trie\lib\marisa/grimoire/io\mapper.obj
    mapper.cc
    marisa-trie\lib\marisa/grimoire/io\mapper.cc(4): fatal error C1083: Cannot open include file: 'windows.h': No such file or directory
    error: command 'C:\\Program Files (x86)\\Microsoft Visual Studio\\2019\\BuildTools\\VC\\Tools\\MSVC\\14.27.29110\\bin\\HostX86\\x64\\cl.exe' failed with exit status 2
    ----------------------------------------
ERROR: Command errored out with exit status 1: 'c:\users\lenovo\appdata\local\programs\python\python38\python.exe' -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'C:\\Users\\LENOVO\\AppData\\Local\\Temp\\pip-install-n07jydhk\\marisa-trie\\setup.py'"'"'; __file__='"'"'C:\\Users\\LENOVO\\AppData\\Local\\Temp\\pip-install-n07jydhk\\marisa-trie\\setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' install --record 'C:\Users\LENOVO\AppData\Local\Temp\pip-record-teibnt8o\install-record.txt' --single-version-externally-managed --compile --install-headers 'c:\users\lenovo\appdata\local\programs\python\python38\Include\marisa-trie' Check the logs for full command output.

C:\Users\LENOVO>

Missing argument for EpihanTraditional

In https://github.com/dmort27/epitran/blob/master/epitran/epihan.py#L103, there is no tones argument, but the construction of the object in https://github.com/dmort27/epitran/blob/master/epitran/_epitran.py#L44 gives the argument, causing an error.

No rule to make target 'lex_lookup'

Hello

I am trying to install lex_lookup, since I wish to convert an English text to API.
I ran Cygwin using Windows 10. I followed the instructions, including changing the "cp -pd" to "cp -pR" in the relevant flite-2.0.5-current\main\Makefile file. However, I cannot manage to run this command - "make lex_lookup".

Thank you very much for your help.

Usage of BCP 47 tags with ASR systems

I am curious to get a sense of what other researchers feel about the use of BCP47 tags for speech recognition models. And at what level (data ID, on training data, or the model itself, or out put from the model? Read more about BCP47 here: https://www.w3.org/International/articles/language-tags/
https://tools.ietf.org/html/bcp47

Some months ago I was on the IETF mailing list for sub-tags, and suggest that Speech to text and text to speech models should have tags identifying them. But there didn't seem to be any great "ah Ha's" from that crowd.

Citing Epitran

Hi!

I used this library for some work that I am writing a paper on. Is there something that I can cite? I should note that I also used Panphon and cited appropriately from the paper linked in that README.

How to get offset mapping?

For example I have epi.transliterate('янъ') -> jan
With word_to_tuples I can find skipped 'ъ', but how can I know what 'я' occupy the first two indexes?

EpihanTraditional Class does not have regexp attribute

When running Backoff class with "cmn-Hant" (which uses EpihanTraditional), it complains with the following error:

File "/usr2/home/amuis/anaconda3/envs/py36/lib/python3.6/site-packages/epitran/backoff.py", line 46, in transliterate
    m = lang.epi.regexp.match(dia.process(token))
AttributeError: 'EpihanTraditional' object has no attribute 'regexp'

This can be easily fixed by adding the following line at the end of https://github.com/dmort27/epitran/blob/master/epitran/epihan.py

self.regexp = re.compile(r'\p{Han}')

I assume the character class "[{Han}" captures both traditional and simplified Chinese.

In French, the final 's' should be silent, 'es' shouldn't after a consonant

Hello, I just discoverd this awesome module, and I found two issues with the French language (both with fra-Latn and fra-Latn-np).

Final 's'

When a word ends with a 's', the 's' is silent. So "il" ("he / she") is pronounced in the same way as "ils" ("they"). However, when I try epi.transliterate("il") and epi.transliterate("ils"), it returns il and ils.

Final 'es'

The final 'es' is pronounced when it comes after a consonant. For example, "faites" ("do") is pronounced "fɛt" and "fait" ("done") is pronounced "fɛ". But transliterate() returns "fe" and "fe".

In the same way, it returns "ɡaraʒ" for "garage" (which is correct, Wikitionary gives "ɡa.ʁaʒ") but "ɡara" for "garages".

Unable to run word_to_tuples for English

I am facing an issue running the model for English. I have installed Flite and am able to run c = os.system(command) from my python script as well.
I get the following warning:
WARNING:root:lex_lookup (from flite) is not installed. Did anyone else face this issue? Could you let me know how you have solved it? Thanks!

help please: couldn't get the amh-Ethi working

Here is the traceback for using the Epitran("amh-Ethi"), for other languages it works fine.

import epitran
epi = epitran.Epitran("amh-Ethi")
Traceback (most recent call last):
File "", line 1, in
File "epitran/_epitran.py", line 42, in init
self.epi = SimpleEpitran(code, preproc, postproc, ligatures)
File "epitran/simple.py", line 52, in init
self.postprocessor = PrePostProcessor(code, 'post')
File "epitran/ppprocessor.py", line 28, in init
self.rules = self._read_rules(code, fix)
File "epitran/ppprocessor.py", line 38, in _read_rules
return Rules([abs_fn])
File "epitran/rules.py", line 28, in init
rules = self._read_rule_file(rule_file)
File "epitran/rules.py", line 36, in _read_rule_file
rules.append(self._read_rule(line))
File "epitran/rules.py", line 65, in _read_rule
return self._fields_to_function(a, b, X, Y)
File "epitran/rules.py", line 81, in _fields_to_function
regexp = re.compile(left)
File "/home/anaconda2/lib/python2.7/site-packages/regex.py", line 345, in compile
return _compile(pattern, flags, kwargs)
File "/home/anaconda2/lib/python2.7/site-packages/regex.py", line 490, in _compile
caught_exception.pos)
_regex_core.error: missing ) at position 53

Meaning of the words 'pau' and 'null' at the start of the file arpabet.csv

Hi David,

Related to the issue I rose on the 13th May 2019 (list of all ipa characters used for English and Polish), What do the words 'pau' and 'null' at the start of the file arpabet.csv mean?

Cheers,
Adnane

The same word is transliterated differently

I got a strange transliteration for Italian:

abiud d͡ʒenerɔ eliat͡ʃim eliat͡ʃim ɡenerɔ asor

ɡenerɔ should be like d͡ʒenerɔ. This happens if the string is part of a much larger string, but not when it is transliterated in isolation (i.e., only that string).

Can we convert the IPA back to the language token?

Thanks for your great job firstly ! It's a really awesome project ! And I have a question about how to convert the ipa token to the language token ? Or is it possible ?

backoff hanging at non-linguistic input

Just testing this out for fun... one thing that I notice is that the backoff feature seems to hang if it gets some input that isn't in its alphabets, e.g.

epi = epitran.Epitran('tur-Latn')
epi.transliterate('merhaba: nasilsin?!')
'meɾhaba: nasilsin?!'

works pretty quick (and would be better achieved if I had used an actual Turkish keyboard)

Wheras

backoff = Backoff(['tur-Latn', 'hin-Deva'])
backoff.transliterate('merhaba: nasilsin?!')

seems to hangs indefinitely...

Maybe backoff needs an extra something for pass-throughs? :)

Does context matter?

Should I feed the individual words to the epitran or a whole sentence to the epitran?
Are there any rules using contexts to transliterate the words?

itinerarium compatibility

What is wrong: pronunciation library https://itinerarium.github.io/phoneme-synthesis/ or this library encoding results?

attempt to transliterate english not working, despite flite being installed properly

I've been trying to use the english transliteration, without success.

I did follow installation instructions for flite (and also copied the relevant binaries in the /usr/local/bin), and the process seems to have worked since I do not get anymore the "lex_lookup not installed" kind of error.

However, I'm still stuck at a rather cryptic KeyError. When I do (in python3):

import epitran
epi = epitran.Epitran('eng-Latn')
epi.transliterate('Berkeley')

this is what I get:

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-8-e30894fd177f> in <module>
----> 1 epi.transliterate('Berkeley')

~/.local/lib/python3.7/site-packages/epitran/_epitran.py in transliterate(self, word, normpunc, ligatures)
     60             unicode: IPA string
     61         """
---> 62         return self.epi.transliterate(word, normpunc, ligatures)
     63 
     64     def reverse_transliterate(self, ipa):

~/.local/lib/python3.7/site-packages/epitran/flite.py in transliterate(self, text, normpunc, ligatures)
     89         for chunk in self.chunk_re.findall(text):
     90             if self.letter_re.match(chunk):
---> 91                 acc.append(self.english_g2p(chunk))
     92             else:
     93                 acc.append(chunk)

~/.local/lib/python3.7/site-packages/epitran/flite.py in english_g2p(self, text)
    205             logging.warning('Non-zero exit status from lex_lookup.')
    206             arpa_text = ''
--> 207         return self.arpa_to_ipa(arpa_text)

~/.local/lib/python3.7/site-packages/epitran/flite.py in arpa_to_ipa(self, arpa_text, ligatures)
     73         arpa_list = map(lambda d: re.sub('\d', '', d), arpa_list)
     74         ipa_list = map(lambda d: self.arpa_map[d], arpa_list)
---> 75         text = ''.join(ipa_list)
     76         return text
     77 

~/.local/lib/python3.7/site-packages/epitran/flite.py in <lambda>(d)
     72         arpa_list = self.arpa_text_to_list(arpa_text)
     73         arpa_list = map(lambda d: re.sub('\d', '', d), arpa_list)
---> 74         ipa_list = map(lambda d: self.arpa_map[d], arpa_list)
     75         text = ''.join(ipa_list)
     76         return text

KeyError: 'iy)\n(b'

No matter my query, it seems self.arpa_map does not have it. What am I doing wrong?

Kirundu <s> is mapped to /a/

lex_lookup (from flite) is not installed

How to solve this problem?

import epitran

epi = epitran.Epitran('eng-Latn')

epi.transliterate('Hello')

WARNING:root:lex_lookup (from flite) is not installed.
Traceback (most recent call last):

  File "<ipython-input-3-9e6f98d7c4c9>", line 1, in <module>
    epi.transliterate('Hello')

  File "C:\ProgramData\Anaconda3\lib\site-packages\epitran\_epitran.py", line 62, in transliterate
    return self.epi.transliterate(word, normpunc, ligatures)

  File "C:\ProgramData\Anaconda3\lib\site-packages\epitran\flite.py", line 94, in transliterate
    acc.append(self.english_g2p(chunk))

  File "C:\ProgramData\Anaconda3\lib\site-packages\epitran\flite.py", line 212, in english_g2p
    arpa_text = arpa_text.splitlines()[0]

IndexError: list index out of range

It's working for other languages but not english

What is Language "Spaces" section in the README?

Are these languages that should also be approached with caution? Not sure what this section in the README means.

List of all ipa characters used for English and Polish

Hi David,

Is there a way to directly get the list of all ipa symbols that you use for English and Polish?

Thanks,
Adnane

Incorrect transliteration of Yoruba /j/

It seems that Yoruba, which uses ⟨y⟩ for the approximant /j/, is being incorrectly transcribed such that ⟨y⟩ becomes the vowel /y/.

>>> import epitran
>>> translator = epitran.Epitran('yor-Latn')
>>> translator.transliterate('Yorùbá') # expected output: 'jōrùbá'
'yorùbá'

KeyError

When will this problem be fixed? Thank you very much!

couldn't find the epitranscriber.py script in this repo?

errors for Italian

epi = epitran.Epitran("ita-Latn") 
epi.transliterate("motorizzazione")

return 'motorit͡sasione', but it should be, at least, 'motorit͡sat͡sione' or, better, 'motorit͡st͡sat͡st͡sione' (semivowel "i" should be "j", but I do not know how fine-grained the transliteration is supposed to be)

Duplicated entries in ipa-xsampa.csv

I dowloaded ipa-xsampa.csv and find some errors in the data, e.g.

R\ in X.SAMPA maps to vd uvular fricative and vl uvular trill
glottal plosive has two identical rows

I modified them based on wikipedia. I think you may like to check the modified file: ipa-xsampa-modified.csv.txt. Note that I modified the file according to my requirements, so it might not suit your needs.

Thanks for the data!

IndexError: list index out of range

When I am running this code:

import epitran
epi = epitran.Epitran('eng-Latn')
print (epi.transliterate(u'Berkeley'))

I am using python+3. Would you please help me to fix this error?

File "/home/hamada/.local/lib/python3.6/site-packages/epitran/flite.py", line 212, in english_g2p
arpa_text = arpa_text.splitlines()[0]
IndexError: list index out of range

flite and t2p are provided in Ubuntu packages

You can install flite by sudo apt install flite. t2p is included.

https://packages.ubuntu.com/search?keywords=flite&searchon=names&exact=1&suite=all&section=all

$ dpkg -S /usr/bin/t2p
flite: /usr/bin/t2p
$ flite --version
  Carnegie Mellon University, Copyright (c) 1999-2016, all rights reserved
  version: flite-2.1-release Dec 2017 (http://cmuflite.org)
$ cat /etc/os-release
NAME="Ubuntu"
VERSION="18.04.5 LTS (Bionic Beaver)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 18.04.5 LTS"
VERSION_ID="18.04"
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
VERSION_CODENAME=bionic
UBUNTU_CODENAME=bionic

lex_lookup must be still built from source.

When i use epitran--English - Latn, it tells that root:lex_lookup (from flite) is not installed. But I have installed lex_lookup

When i use epitran--English - Latn, it tells that root:lex_lookup (from flite) is not installed. But I have installed lex_lookup.

(base) [root@host-10-29-0-161 testsuite]# make lex_lookup
Makefile:83: warning: overriding recipe for target multi_thread' Makefile:80: warning: ignoring old recipe for target multi_thread'
make: `lex_lookup' is up to date.

Error:

import epitran
epi = epitran.Epitran('eng-Latn')
print(epi.transliterate(u'Berkeley'))
WARNING:root:lex_lookup (from flite) is not installed.
Traceback (most recent call last):
File "", line 1, in
File "/opt/huawei/data1/z00574176/G2P/git_reproduced/epitran-master/epitran/_epitran.py", line 62, in transliterate
return self.epi.transliterate(word, normpunc, ligatures)
File "/opt/huawei/data1/z00574176/G2P/git_reproduced/epitran-master/epitran/flite.py", line 96, in transliterate
acc.append(self.english_g2p(chunk))
File "/opt/huawei/data1/z00574176/G2P/git_reproduced/epitran-master/epitran/flite.py", line 214, in english_g2p
arpa_text = arpa_text.splitlines()[0]
IndexError: list index out of range

keyerror on any english transliteration

I wanted to give this a whirl but hit some speedbump from the getgo :

In [2]: epi = epitran.Epitran('eng-Latn')
In [3]: epi.transliterate('iceland')
WARNING:root:lex_lookup (from flite) is not installed.
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-3-54aaf7e8072d> in <module>()
----> 1 epi.transliterate('iceland')

/home/jeremy/.local/lib/python3.6/site-packages/epitran/_epitran.py in transliterate(self, word, normpunc, ligatures)
     60             unicode: IPA string
     61         """
---> 62         return self.epi.transliterate(word, normpunc, ligatures)
     63 
     64     def reverse_transliterate(self, ipa):

/home/jeremy/.local/lib/python3.6/site-packages/epitran/flite.py in transliterate(self, text, normpunc, ligatures)
     89         for chunk in self.chunk_re.findall(text):
     90             if self.letter_re.match(chunk):
---> 91                 acc.append(self.english_g2p(chunk))
     92             else:
     93                 acc.append(chunk)

/home/jeremy/.local/lib/python3.6/site-packages/epitran/flite.py in english_g2p(self, text)
    205             logging.warning('Non-zero exit status from lex_lookup.')
    206             arpa_text = ''
--> 207         return self.arpa_to_ipa(arpa_text)

/home/jeremy/.local/lib/python3.6/site-packages/epitran/flite.py in arpa_to_ipa(self, arpa_text, ligatures)
     73         arpa_list = map(lambda d: re.sub('\d', '', d), arpa_list)
     74         ipa_list = map(lambda d: self.arpa_map[d], arpa_list)
---> 75         text = ''.join(ipa_list)
     76         return text
     77 

/home/jeremy/.local/lib/python3.6/site-packages/epitran/flite.py in <lambda>(d)
     72         arpa_list = self.arpa_text_to_list(arpa_text)
     73         arpa_list = map(lambda d: re.sub('\d', '', d), arpa_list)
---> 74         ipa_list = map(lambda d: self.arpa_map[d], arpa_list)
     75         text = ''.join(ipa_list)
     76         return text

KeyError: ''

In [4]: epi2 = epitran.Epitran('rus-Cyrl')

In [7]: epi.transliterate('')
Out[7]: ''

In [9]: epi2.transliterate('Приве́т')
Out[9]: 'prʲivʲét'

German transliteration issues

Hello,

I came across what I believe to be a bug in German transliteration of the grapheme 's'. This occurs when using the 'deu-Latn' and the 'deu-Latn-nar' dictionaries. Take for example the word 'sehr':

In [14]: epi1.transliterate('sehr')
Out[14]: 't͡seːə'
In [16]: epi3.transliterate('sehr')
Out[16]: 't͡seːɐ'

Here epi1 was initialized with the 'deu-Latn' dictionary and epi3 with the 'deu-Latn-nar' dictionary.

In both cases I would expect the 's' in 'sehr' to be transliterated with [z]. I know that [s] is also possible in this case when dealing with southern German dialects, and I see this transliteration when using the 'deu-Latn-np' dictionary. However, after consulting all my sources, I don't see a case where this can be transliterated as [t͡s].

Another example would be the word 'Stock':

In [20]: epi1.transliterate('Stock')
Out[20]: 'stok'
In [21]: epi3.transliterate('Stock')
Out[21]: 'stok'

In the case of the 'deu-Latn' example, I can understand why this may be transliterated as [s], but at least with the narrow transliteration I would expect [ʃ]. As far as I know, [s] only occurs in this environment in northern German dialects.

Would you mind investigating this with me? What I've done so far is look at 10s of examples(I'm transliterating a large corpus) and it seems that it happens across the board, no exceptions. I also made sure that I pip installed the latest version of Epitran.

Regex + rules difference between [] and ().

In fra-Latn.txt preprocessors there are some matches that use [] and others that use ()

::vowel:: = a|á|â|æ|e|é|è|ê|ë|i|î|ï|o|ô|œ|u|ù|û|ü|A|Á|Â|Æ|E|É|È|Ê|Ë|I|Î|Ï|O|Ô|Œ|U|Ù|Û|Ü|ɛ
::front_vowel:: = e|é|è|ê|ë|i|î|ï|y|E|É|È|Ê|Ë|I|Î|Ï|Y|ɛ
::consonant:: = b|ç|c|ch|d|f|g|j|k|l|m|n|p|r|s|t|v|w|z|ʒ

% Treatment of <c> and <s>
sc -> s / _ [::front_vowel::]
c -> s / _ [::front_vowel::]

% High vowels become glides before vowels
ou -> w / _ (::vowel::)
u -> ɥ / _ (::vowel::)

Is a difference in behaviour between the two?

Am I right in thinking that:

[::front_vowel::] is [e|é|è|ê|ë|i|î|ï|y|E|É|È|Ê|Ë|I|Î|Ï|Y|ɛ] in regex.
(::front_vowel::) is (e|é|è|ê|ë|i|î|ï|y|E|É|È|Ê|Ë|I|Î|Ï|Y|ɛ) in regex.

From what I understand of regex they look as though they'd do the same thing, except [::front_vowel::] would also match the | char.

I also don't think [] would work if there are two or more chars in a group, for example ch in:

::consonant:: = b|ç|c|ch|d|f|g|j|k|l|m|n|p|r|s|t|v|w|z|ʒ

I'd guess that () also create capturing groups but I'm not sure if that's being utilised.

Any guidance would be greatly appreciated.

Different IPA with punctuation in German

Hello all,

First off I wanted to say well done on Epitran! It is a tool that has proven useful for many projects of mine.

I stumbled across something today and I wanted to know if Epitran was designed to do this, or if it's a bug. I noticed that when I transliterate words in German, they have a different IPA transliterations when adding punctuation. As far as I can tell, this doesn't happen in any other language (I tried to reproduce the error in Polish, Russian and English.).

Examples:

transliterate('heute') -> 'hoytə' but transliterate('heute?') -> 'hoyhte?'
transliterate('Tag') -> 'tak' but transliterate('Tag!') -> 'taɡ!'
transliterate('Ende') -> 'əndə' but transliterate('Ende.') -> 'ənde.'

For the last two examples, I could accept the transliterations that are produced when punctuation is added to the string, when phonetic environment and dialect are taken into consideration. However, to my knowledge, I don't know of any case where 'heute' should have an 'h' after the diphthong in its transliteration.

When using the transliterate function, I normally use normpunc=True and ligatures=True, but even disabling those flags produces the same results. I also used pip to check that I was using the latest version of Epitran.

I would really appreciate some info on this matter, as it will guide my future projects. Thanks a lot for your time!

question about machine learning in this framework

Is it correct to understand from the paper, that no machine learning is involved or directly integrated in the framework that epitran is? or could you point me in the right direction?

epitran.exceptions.DatafileError: Header is ["Prth", "Phon"] instead of ["Orth", "Phon"].

I'm trying to use epitran to obtain the correct phonetic pronunciations of French words. I did get it working eventually through the use of the fra-Latn preprocessor, however its performance is lackluster. It seems to give me very literal translations, and ones that never use the uvular "ʁ" sound or the sound separating ".":

"acteur" ("actor") comes out as "atyr" (should be "ak.tœʁ")
"actrice" ("actress") comes out as "aktriz" (should be "ak.tʁis")
"chat" ("cat") comes out as "ʃa", which is correct, but at least one time when I tried it I got trailing symbols, like "ʃat"
"chien" ("dog") comes out as "ʃjâ" when it should be "ʃjɛ̃"

So after having mixed performance with that, I looked at the documentation and noticed there was a more phonetic translator "fra-Latn-np". Upon attempting to use this to translate any given word, I get the following error:

Traceback (most recent call last):
  File "main.py", line 6, in <module>
    epi = epitran.Epitran('fra-Latn-np')
  File "/home/callum/.local/share/virtualenvs/first625-xxpZk1TH/lib/python3.8/site-packages/epitran/_epitran.py", line 46, in __init__
    self.epi = SimpleEpitran(code, preproc, postproc, ligatures, rev, rev_preproc, rev_postproc, tones=tones)
  File "/home/callum/.local/share/virtualenvs/first625-xxpZk1TH/lib/python3.8/site-packages/epitran/simple.py", line 43, in __init__
    self.g2p = self._load_g2p_map(code, False)
  File "/home/callum/.local/share/virtualenvs/first625-xxpZk1TH/lib/python3.8/site-packages/epitran/simple.py", line 100, in _load_g2p_map
    raise DatafileError('Header is ["{}", "{}"] instead of ["Orth", "Phon"].'.format(orth, phon))
epitran.exceptions.DatafileError: Header is ["Prth", "Phon"] instead of ["Orth", "Phon"].

I'm not sure what causes it, but looking in that directory there is also an undocumented "fra-Lang-p" preprocessor, which does better at other times and worse than others. Could you please explain what is going on here?

Here is my code:

import sys
from google_trans_new import google_translator
import epitran

translator = google_translator()
epi = epitran.Epitran('fra-Latn-np')

# Translate the first system argument
#translated_text = translator.translate(sys.argv[1], lang_src='en', lang_tgt='fr')
# Get the IPA pronunciation
#ipa_symbols = epi.transliterate(translated_text)

#print(translated_text)
#print(ipa_symbols)
print(epi.transliterate(sys.argv[1]))

Bad cyrrillic simbols sometimes

So I use this code:

from epitran.backoff import Backoff

backoff = Backoff(['fas-Arab', 'rus-Cyrl'])
                  
backoff.transliterate('Привет дорогой друг пидор')

and it gives

'prʲivʲet doroɡoй druɡ pʲidor'

as u see, there is the russian й in result, which should be (maybe) j. Or am I wrong?

Attribute error :module 'epitran' has no attribute 'Epitran'

Hi I am getting attribute error.
To give a background, i installed in my environment in conda and from there i ran setup.py .

I am using jupyter notebook

Any support for Norwegian, Danish and Finnish?

Thanks for your wonderfull work. Do you have any plan for the support of Norwegian, Danish and Finnish?

Function xsampa_list() in _eptiran.py deletes things a lot

For instance in cebuano

felix --> [e, l, i]
x --> []

In swedish

och --> []

I fixed this (I think), by simply replacing the commented line below with the uncommented one. Maybe this is horribly wrong, but it seems to work now.

#ipa_segs = self.ft.ipa_segs(self.epi.strict_trans(word, normpunc,
# ligaturize))
ipa_segs = self.ft.segs_safe(self.epi.transliterate(word, normpunc, ligaturize))

Support for other English varieties

Hello,

If I understand correctly, if you use Flite as the backend for English G2P, you get transcriptions in US English. How would one go about getting transcriptions for other varieties of English, e.g. Received Pronunciation or Australian English? I know that Festvox supports British and Scottish English, so could it in theory be used as the backend for English G2P?

For my use case, it's not super important that the vowels are precise, but the rhoticity distinction would be extremely useful.

Thanks!

word_to_tuples AttributeError for eng-Latn

It happens here because the function panphon.FeatureTable.segs isn't defined:

https://github.com/dmort27/epitran/blob/master/epitran/flite.py#L155

Using Python 3.7.4. This is the output of pip freeze:

editdistance==0.5.3
epitran==1.1
marisa-trie==0.7.5
munkres==1.1.2
numpy==1.17.0
panphon==0.15
PyYAML==5.1.2
regex==2019.8.19
unicodecsv==0.14.1

English IPA translation

I am trying to use Epitran to create IPA conversions for English sentences, and it doesn't produce results I expect for some common words.

import epitran
epi = epitran.Epitran('eng-Latn')
epi.transliterate("was does buzz")
'wɑz dowz bʌz'

Note that IPA for does contains a w. Looking through dictionaries, I find ˈdəz, dɪz. When all three are put into a simple IPA reader, the dictionary versions sound correct and epitran's translation sounds wrong.

Can't pip install on Python 3

On Debian stretch:

$ pip3 install epitran
Collecting epitran
  Using cached epitran-0.23-py2.py3-none-any.whl
Collecting marisa-trie (from epitran)
  Using cached marisa_trie-0.7.4-cp35-cp35m-manylinux1_x86_64.whl
Collecting panphon>=0.12 (from epitran)
  Using cached panphon-0.12-py2.py3-none-any.whl
Collecting unicodecsv (from epitran)
Collecting subprocess32 (from epitran)
  Using cached subprocess32-3.2.7.tar.gz
    Complete output from command python setup.py egg_info:
    This backport is for Python 2.x only.
    
    ----------------------------------------
Command "python setup.py egg_info" failed with error code 1 in /tmp/pip-build-7hr0xfki/subprocess32/

dmort27 / epitran Goto Github PK

epitran's Introduction

Epitran

Usage

Using the epitran Module

The Epitran class

The Backoff class

DictFirst

Preprocessors, postprocessors, and their pitfalls

Using the epitran.vector Module

Language Support

Transliteration Language/Script Pairs

Languages with limited support due to highly ambiguous orthographies

Language "Spaces"

Installation of Flite (for English G2P)

t2p

lex_lookup

Usage

Extending Epitran with map files, preprocessors and postprocessors

Map files (mapping tables)

Preprocesssors and postprocessors

Symbol definitions

Rewrite rules

Comments and blank lines

A strategy for adding language support

Citing Epitran

epitran's People

Contributors

Stargazers

Watchers

Forkers

epitran's Issues

Final 's'

Final 'es'

Recommend Projects

Recommend Topics

Recommend Org

Using the `epitran` Module

Using the `epitran.vector` Module

`t2p`

`lex_lookup`