roedoejet / g2p Goto Github PK

View Code? Open in Web Editor NEW

119.0 10.0 26.0 28.03 MB

Grapheme-to-Phoneme transductions that preserve input and output indices, and support cross-lingual g2p!

Home Page: https://g2p-studio.herokuapp.com

License: Other

Python 90.13% HTML 1.99% CSS 0.15% JavaScript 7.27% Dockerfile 0.24% Procfile 0.02% Shell 0.21%

g2p's Introduction

Gᵢ2Pᵢ

Grapheme-to-Phoneme transformations that preserve input and output indices!

This library is for handling arbitrary conversions between input and output segments while preserving indices.

Gᵢ2Pᵢ
- Table of Contents
- Background
- Install
- Usage
- CLI
- Studio
- Maintainers
- Contributing
  - Contributors
- How to Cite
- License

Background

The initial version of this package was developed by Patrick Littell and was developed in order to allow for g2p from community orthographies to IPA and back again in ReadAlong-Studio. We decided to then pull out the g2p mechanism from Convertextract which allows transducer relations to be declared in CSV files, and turn it into its own library - here it is! For an in-depth series on the motivation and how to use this tool, have a look at this 7-part series on the Mother Tongues Blog, or for a more technical overview, have a look at this paper.

Install

The best thing to do is install with pip pip install g2p. This command will install the latest release published on PyPI g2p releases.

You can also use hatch (see hatch installation instructions) to set up an isolated local development environment, which may be useful if you wish to contribute new mappings:

$ git clone https://github.com/roedoejet/g2p.git
$ cd g2p
$ hatch shell

You can also simply install an "editable" version with pip (but it is recommended to do this in a virtual environment or a conda environment):

$ git clone https://github.com/roedoejet/g2p.git
$ cd g2p
$ pip install -e .

Usage

The easiest way to create a transducer is to use the g2p.make_g2p function.

To use it, first import the function:

from g2p import make_g2p

Then, call it with an argument for in_lang and out_lang. Both must be strings equal to the name of a particular mapping.

>>> transducer = make_g2p('dan', 'eng-arpabet')
>>> transducer('hej').output_string
'HH EH Y'

There must be a valid path between the in_lang and out_lang in order for this to work. If you've edited a mapping or added a custom mapping, you must update g2p to include it: g2p update

Writing mapping files

Mapping files are written as either CSV or JSON files.

CSV

CSV files write each new rule as a new line and consist of at least two columns, and up to four. The first column is required and corresponds to the rule's input. The second column is also required and corresponds to the rule's output. The third column is optional and corresponds to the context before the rule input. The fourth column is also optional and corresponds to the context after the rule input. For example:

This mapping describes two rules; a -> b and c -> d.

a,b
c,d

This mapping describes two rules; a -> b / c _ d¹ and a -> e

a,b,c,d
a,e

The g2p studio exports its rules to CSV format.

JSON

JSON files are written as an array of objects where each object corresponds to a new rule. The following two examples illustrate how the examples from the CSV section above would be written in JSON:

This mapping describes two rules; a -> b and c -> d.

 [
   {
     "in": "a",
     "out": "b"
   },
   {
     "in": "c",
     "out": "d"
   }
 ]

This mapping describes two rules; a -> b / c _ d¹ and a -> e

 [
   {
     "in": "a",
     "out": "b",
     "context_before": "c",
     "context_after": "d"
   },
   {
     "in": "a",
     "out": "e"
   }
 ]

Python

You can also write your rules programatically in Python. For example:

from g2p.mappings import Mapping, Rule
from g2p.transducer import Transducer

mapping = Mapping(rules=[
    Rule(rule_input="a", rule_output="b", context_before="c", context_after="d"),
    Rule(rule_input="a", rule_output="e")
  ])

transducer = Transducer(mapping)
transducer('cad') # returns "cbd"

CLI

`update`

If you edit or add new mappings to the g2p.mappings.langs folder, you need to update g2p. You do this by running g2p update

`convert`

If you want to convert a string on the command line, you can use g2p convert <input_text> <in_lang> <out_lang>

Ex. g2p convert hej dan eng-arpabet would produce HH EH Y

If you have written your own mapping that is not included in the standard g2p library, you can point to its configuration file using the --config flag, as in g2p convert <input_text> <in_lang> <out_lang> --config path/to/config.yml. This will add the mappings defined in your configuration to the existing g2p network, so be careful to avoid namespace errors.

`generate-mapping`

If your language has a mapping to IPA and you want to generate a mapping between that and the English IPA mapping, you can use g2p generate-mapping <in_lang> --ipa. Remember to run g2p update before so that it has the latest mappings for your language.

Ex. g2p generate-mapping dan --ipa will produce a mapping from dan-ipa to eng-ipa. You must also run g2p update afterwards to update g2p. The resulting mapping will be added to the folder in g2p.mappings.langs.generated

Note: if your language goes through an intermediate representation, e.g., lang -> lang-equiv -> lang-ipa, specify both the <in_lang> and <out_lang> of your final IPA mapping to g2p generate-mapping. E.g., to generate crl-ipa -> eng-ipa, you would run g2p generate-mapping --ipa crl-equiv crl-ipa.

g2p workflow diagram

The interactions between g2p update and g2p generate-mapping are not fully intuitive, so this diagram should help understand what's going on:

Text DB: this is the textual database of g2p conversion rules created by contributors. It consists of these files:

g2p/mappings/langs/*/*.csv
g2p/mappings/langs/*/*.json
g2p/mappings/langs/*/*.yaml

Gen DB: this is the part of the textual database that is generated when running the g2p generate-mapping command:

g2p/mappings/generated/*

Compiled DB: this contains the same info as Text DB + Gen DB, but in a format optimized for fast reading by the machine. This is what any program using g2p reads: g2p convert, readalongs align, convertextract, and also g2p generate-mapping. It consists of these files:

g2p/mappings/langs/langs.json.gz
g2p/mappings/langs/network.json.gz
g2p/static/languages-network.json

So, when you write a new g2p mapping for a language, say lll, and you want to be able to convert text from lll to eng-ipa or eng-arpabet, you need to do the following:

Write the mapping from lll to lll-ipa in g2p/mappings/langs/lll/. You've just updated Text DB.
Run g2p update to regenerate Compiled DB from the current Text DB and Gen DB, i.e., to incorporate your new mapping rules.
Run g2p generate-mapping --ipa lll to generate g2p/mappings/langs/generated/lll-ipa_to_eng-ipa.json. This is not based on what you wrote directly, but rather on what's in Generated DB.
Run g2p update again. g2p generate-mapping updates Gen DB only, so what gets written there will only be reflected in Compiled DB when you run g2p update once more.

Once you have the Compiled DB, it is then possible to use the g2p convert command, create time-aligned audiobooks with readalongs align, or convert files with the convertextract library.

Studio

You can also run the g2p Studio which is a web interface for creating custom lookup tables to be used with g2p. To run the g2p Studio either visit https://g2p-studio.herokuapp.com/ or run it locally with python run_studio.py.

API for Developers

There is also a REST API available for use in your own applications. To launch it from the command-line use python run_studio.py or uvicorn g2p.app:APP. The API documentation will be viewable (with the ability to use it interactively) at http://localhost:5000/api/v1/docs - an OpenAPI definition is also available at http://localhost:5000/api/v1/openapi.json .

Maintainers

@roedoejet. @joanise.

Contributing

Feel free to dive in! Open an issue or submit PRs.

This repo follows the Contributor Covenant Code of Conduct.

Have a look at Contributing.md for help using our standardized formatting conventions and pre-commit hooks.

Adding a new mapping

In order to add a new mapping, you have to follow the following steps.

Determine your language's ISO 639-3 code.
Add a folder with your language's ISO 639-3 code to g2p/mappings/langs
Add a configuration file at g2p/mappings/langs/<yourlangISOcode>/config-g2p.yaml. Here is the basic template for a configuration:

<<: &shared
  language_name: <This is the actual name of the language>
mappings:
  - display_name: This is a description of the mapping
    in_lang: This is your language's ISO 639-3 code
    out_lang: This is the output of the mapping
    type: mapping
    authors:
      - <YourNameHere>
    rules_path: <FilenameOfMapping>
    <<: *shared

Add a mapping file. Look at the other mappings for examples, or visit the g2p studio to practise your mappings. Mappings are defined in either a CSV or json file. See writing mapping files for more info.
Start a development shell with hatch shell (or install an editable version with pip install -e .) then update with g2p update
Add some tests in g2p/testspublic/data/<YourIsoCode>.psv. Each line in the file will run a test with the following structure: <in_lang>|<out_lang>|<input_string>|<expected_output>
Run python3 run_tests.py langs to make sure your tests pass.
Make sure you have checked all the boxes and make a [pull request]((https://github.com/roedoejet/g2p/pulls)!

Adding a new language for support with ReadAlongs

This repo is used extensively by ReadAlongs. In order to make your language supported by ReadAlongs, you must add a mapping from your language's orthography to IPA. So, for example, to add Danish (ISO 639-3: dan), the steps above must be followed. The in_lang for the mapping must be dan and the out_lang must be suffixed with 'ipa' as in dan-ipa. The following is the proper configuration:

mappings:
  - display_name: Danish to IPA
    language_name: Danish
    in_lang: dan
    out_lang: dan-ipa
    type: mapping
    authors:
      - Aidan Pine
    rules_path: dan_to_ipa.csv
    abbreviations_path: dan_abbs.csv
    rule_ordering: as-written
    case_sensitive: false
    norm_form: 'none'

Then, you can generate the mapping between dan-ipa and eng-ipa by running g2p generate-mapping --ipa. This will add the mapping to g2p/mappings/langs/generated - do not edit this file, but feel free to have a look. Then, run g2p update and submit a pull request, and tada! Your language is supported by ReadAlongs as well!

Footnotes

1 If this notation is unfamiliar, have a look at phonological rewrite rules ↩

Contributors

This project exists thanks to all the people who contribute.

Citation

If you use this work in a project of yours and write about it, please cite us using the following:

Aidan Pine, Patrick Littell, Eric Joanis, David Huggins-Daines, Christopher Cox, Fineen Davis, Eddie Antonio Santos, Shankhalika Srikanth, Delasie Torkornoo, and Sabrina Yu. 2022. Gᵢ2Pᵢ Rule-based, index-preserving grapheme-to-phoneme transformations Rule-based, index-preserving grapheme-to-phoneme transformations. In Proceedings of the Fifth Workshop on the Use of Computational Methods in the Study of Endangered Languages, pages 52–60, Dublin, Ireland. Association for Computational Linguistics.

Or in BibTeX:

@inproceedings{pine-etal-2022-gi22pi,
    title = "{G}$_i$2{P}$_i$ Rule-based, index-preserving grapheme-to-phoneme transformations",
    author = "Pine, Aidan  and
      Littell, Patrick  and
      Joanis, Eric  and
      Huggins-Daines, David  and
      Cox, Christopher  and
      Davis, Fineen  and
      Antonio Santos, Eddie  and
      Srikanth, Shankhalika  and
      Torkornoo, Delasie  and
      Yu, Sabrina",
    booktitle = "Proceedings of the Fifth Workshop on the Use of Computational Methods in the Study of Endangered Languages",
    month = may,
    year = "2022",
    address = "Dublin, Ireland",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2022.computel-1.7",
    pages = "52--60",
    abstract = "This paper describes the motivation and implementation details for a rule-based, index-preserving grapheme-to-phoneme engine {`}G$_i$2P$_i$' implemented in pure Python and released under the open source MIT license. The engine and interface have been designed to prioritize the developer experience of potential contributors without requiring a high level of programming knowledge. {`}G$_i$2P$_i$' already provides mappings for 30 (mostly Indigenous) languages, and the package is accompanied by a web-based interactive development environment, a RESTful API, and extensive documentation to encourage the addition of more mappings in the future. We also present three downstream applications of {`}G$_i$2P$_i$' and show results of a preliminary evaluation.",
}

License

MIT. See LICENSE for the Copyright and license statements.

g2p's People

Contributors

Stargazers

Watchers

g2p's Issues

g2p generate-mapping misbehaving: no default --out-dir + overwrites generated/config.yaml

First problem, g2p generate-mapping no longer finds its default out-dir:

$ g2p generate-mapping --ipa fra
INFO - Server initialized for eventlet.
Usage: g2p generate-mapping [OPTIONS] [alq|atj|ckt|clc-doulos|clc|crj|crl|crx-
                            sro|crx-syl|ctp|dan|eng-arpabet|fra|git|git-
                            apa|git-norm|hei-doulos|hei|hei-times|iku|iku-
                            sro|kwk-napa|kwk-umista|moh|moh-ascii|nav-
                            times|nav|oji|see|srs|str|tgx|und|win]
Try "g2p generate-mapping --help" for help.

Error: Invalid value for "--out-dir": Directory "" does not exist.

Second problem, when I add --out-dir, generated/config.yaml gets overwritten instead of appended to:

$ g2p generate-mapping --out-dir=g2p/mappings/langs/generated --ipa fra

generates generated/fra-ipa_to_eng-ipa.json as expected, but when I look at generated/config.yaml it lost all the other languages, it only contains French:

$ cat generated/config.yaml
mappings:
  - as_is: true
    authors:
      - Generated 2020-04-07 17:41:05.168894
    case_sensitive: true
    display_name: fra-ipa IPA to eng-ipa IPA
    escape_special: false
    in_lang: fra-ipa
    language_name: fra-ipa
    mapping: fra-ipa_to_eng-ipa.json
    norm_form: NFD
    out_lang: eng-ipa
    reverse: false

when it should have had a block for each of these languages:

$ git show master:g2p/mappings/langs/generated/config.yaml | grep display_name
    display_name: Atikamekw IPA to English IPA
    display_name: Danish IPA to English IPA
    display_name: French IPA to English IPA
    display_name: "SEN\u0106O\u0166EN IPA to English IPA"
    display_name: Algonquin IPA to English IPA
    display_name: Mohawk IPA to English IPA
    display_name: see-ipa IPA to eng-ipa IPA

define mechanism for tokenizer to recognize punctuation as letter conditionally

In PR #82, there's a hack to recognize the dot (.) as a letter when not word final in Tlingit (tli).

We need a more general mechanism to specify that some punctuation is a letter is some contexts for a given language, possibly in its config.json configuration.

E.g.,

tli: . is a letter when not word final
many languages have ' as a letter, possibly restricted to after a consonent or when in the middle of a word

Allow OOVs through API parameter

It should probably be a parameter in the API, like for orthography conversion you usually want OOVs to pass through, whereas for G2P you may want an error flag of some sort to tell the caller there's something strange about the input, so that it can try some other way of getting a pronunciation for that token.

Yes, that makes sense. I guess we'll just add a custom exception for that? Any other way you want it handled? What should we call the API parameter? strict something or other?

This relates to another thing, does gi-to-pi know the input and output
vocabularies of each mapping, so to be able to identify when improper
inputs are submitted or improper outputs are generated?

No, it's currently not required to provide an inventory. I think we could do this, but it would be nice to not have to do it in the mapping files themselves, just so we don't have unnecessary rules (ie x -> x) making the mappings large. I can see adding an inventory key to each mapping in the config though and then pointing that to a separate csv or json or something also in the folder. Then we could also define normalization mappings that could handle basic normalization for OOV characters.

Originally posted by @roedoejet in #29 (comment)

g2p convert --config cannot read config file with just one mapping

First observed by running:

$ g2p convert --config public/mappings/minimal_config.yaml foo min out
INFO - Server initialized for eventlet.
Traceback (most recent call last):
  File "C:\Users\joanise\RAS\ras-env\Scripts\g2p-script.py", line 11, in <module>
    load_entry_point('g2p', 'console_scripts', 'g2p')()
  File "c:\users\joanise\ras\ras-env\lib\site-packages\click\core.py", line 1137, in __call__
    return self.main(*args, **kwargs)
  File "c:\users\joanise\ras\ras-env\lib\site-packages\flask\cli.py", line 596, in main
    return super().main(*args, **kwargs)
  File "c:\users\joanise\ras\ras-env\lib\site-packages\click\core.py", line 1062, in main
    rv = self.invoke(ctx)
  File "c:\users\joanise\ras\ras-env\lib\site-packages\click\core.py", line 1668, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "c:\users\joanise\ras\ras-env\lib\site-packages\click\core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "c:\users\joanise\ras\ras-env\lib\site-packages\click\core.py", line 763, in invoke
    return __callback(*args, **kwargs)
  File "c:\users\joanise\ras\ras-env\lib\site-packages\click\decorators.py", line 26, in new_func
    return f(get_current_context(), *args, **kwargs)
  File "c:\users\joanise\ras\ras-env\lib\site-packages\flask\cli.py", line 440, in decorator
    return __ctx.invoke(f, *args, **kwargs)
  File "c:\users\joanise\ras\ras-env\lib\site-packages\click\core.py", line 763, in invoke
    return __callback(*args, **kwargs)
  File "c:\users\joanise\ras\g2p\g2p\cli.py", line 293, in convert
    MAPPINGS_AVAILABLE.extend(data["mappings"])
KeyError: 'mappings'

I think the issue is line 285, which reads data = load_mapping_from_path(config) but should probably read something like data["mappings"] = [load_mapping_from_path(config)] because that structure is assumed on the subsequent line MAPPINGS_AVAILABLE.extend(data["mappings"]).

However, this suggested code change is not enough, because making it I now get this result:

$ g2p convert --config public/mappings/minimal_config.yaml foo min out
INFO - Server initialized for eventlet.
Usage: g2p convert [OPTIONS] INPUT_TEXT IN_LANG OUT_LANG
Try 'g2p convert -h' for help.

Error: 'min' is not a valid value for 'IN_LANG'

Add 'index of ___' block to rule creator

It would be nice to have a way of accessing the index of an item in a list:

Write documentation

We need to write documentation to make this useable. There are also some standards that we need to include:

For example, folder names in /g2p/mappings/langs should use languages' ISO639-2 Terminology codes (as opposed to Bibliography codes). This is not at all obvious without documentation.

Consider changing configuration names

Certain config file settings are not very clear. For example: as_is if true, applies rules in the order they are declared, but if set to false, will reverse sort them based on the length of the inputs. Instead we could consider:

rule_ordering: as-is

vs.

rule_ordering: sorted

Or something like that.

command line errors should give helpful error messages

Currently, if I type g2p convert frygt dan, forgetting OUT_LANG, or g2p convert frygt, leaving both IN_LANG and OUT_LANG out, I get this exception:

$ g2p convert frygt dan
INFO - Server initialized for eventlet.
Traceback (most recent call last):
  File "C:\Users\joanise\RAS\ras-env\Scripts\g2p-script.py", line 11, in <module>
    load_entry_point('g2p', 'console_scripts', 'g2p')()
  File "c:\users\joanise\ras\ras-env\lib\site-packages\click\core.py", line 764, in __call__
    return self.main(*args, **kwargs)
  File "c:\users\joanise\ras\ras-env\lib\site-packages\flask\cli.py", line 557, in main
    return super(FlaskGroup, self).main(*args, **kwargs)
  File "c:\users\joanise\ras\ras-env\lib\site-packages\click\core.py", line 717, in main
    rv = self.invoke(ctx)
  File "c:\users\joanise\ras\ras-env\lib\site-packages\click\core.py", line 1137, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "c:\users\joanise\ras\ras-env\lib\site-packages\click\core.py", line 956, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "c:\users\joanise\ras\ras-env\lib\site-packages\click\core.py", line 555, in invoke
    return callback(*args, **kwargs)
  File "c:\users\joanise\ras\ras-env\lib\site-packages\click\decorators.py", line 17, in new_func
    return f(get_current_context(), *args, **kwargs)
  File "c:\users\joanise\ras\ras-env\lib\site-packages\flask\cli.py", line 412, in decorator
    return __ctx.invoke(f, *args, **kwargs)
  File "c:\users\joanise\ras\ras-env\lib\site-packages\click\core.py", line 555, in invoke
    return callback(*args, **kwargs)
  File "c:\users\joanise\ras\g2p\g2p\cli.py", line 76, in convert
    tg = transducer(input_text)
UnboundLocalError: local variable 'transducer' referenced before assignment

Instead, I should get an error message that says IN_LANG and OUT_LANG are missing, similar to the message produced when I run g2p convert with no arguments at all:

$ g2p convert
INFO - Server initialized for eventlet.
Usage: g2p convert [OPTIONS] INPUT_TEXT [IN_LANG] [OUT_LANG]
Try "g2p convert --help" for help.

Error: Missing argument "INPUT_TEXT".

Indices ordering not preserved

When calling a Transducer object initialized with a Correspondence containing "l{1}\u0313{2},ʔ{2}l{1}", the actual output returned is 'lʔ' but the expected output is 'ʔl'

cors = Correpondence([{"from": "l{1}\u0313{2}", "to": "ʔ{2}l{1}"}])
transducer = Transducer(cors)
transducer('l\u0313')

returns 'lʔ'

If the indices preserved ordering then this would not occur.

This is running in python3 on Windows 10.

disjunction with indices doesn't work

        sanity_mapping = Mapping([{"in": "a{1}", "out": "c{1}"}])
        sanity_transducer = Transducer(sanity_mapping)
        self.assertEqual(sanity_transducer('a').output_string, 'c') # Passes
        mapping = Mapping([{"in": "a{1}|b{1}", "out": "c{1}"}])
        transducer = Transducer(mapping)
        self.assertEqual(transducer('a').output_string, 'c') # Fails
        self.assertEqual(transducer('a').output_string, transducer('b').output_string) # Fails

win does not handle : correctly

The win g2p outputs ascii : instead of \u02D0, which is the correct IPA symbol for length markers.
The win-ipa to eng-ipa mapping sometimes doubles them up. All the lengthened vowels needs to be reviewed.

e.g.:

$ g2p convert "ō" win eng-arpabet
OW ː
$ g2p convert "ee" win eng-arpabet
EY ː

There is no public/data/win.* test suite to validate that g2p works correctly on this language.

Make messages for user errors friendlier in `g2p convert` CLI

g2p convert text foo bar outputs an unnecessary Traceback after outputting an informative error message. Ditto g2p convert text crl foo.

g2p convert text crl crl outputs a Traceback instead of just saying that's a noop.

g2p convert text crl fra outputs two unnecessary Tracebacks after outputting an informative error message.

All of these could follow the sample of g2p convert text frl, which outputs an error message and tells the user to call g2p convert -h for help.

Support TSV files

By the way, just so you know: tab-separated did not fly. the .csv files have to have commas as the separator, or you get a cryptic exception.

We should support TSV files, but also add meaningful exception handling.

context_after="\s|$" does not always work correctly for end of word

In some mappings, we use context_after = \s|$ to do some processing on the end of a word.

Examples:

French:

fra_to_ipa.csv had rules like `` to delete the silent word-final "s" (changed to \b on 2021-11-01)
g2p convert "tests, tests tests" outputs tʌsts, tʌst tʌst, showing that before a space, and string final, it works, but not before a comma.

Mi'kmaq:

mic_to_ipa.json uses $ to match word-final.
g2p convert "tt" mic mic-ipa outputs tət
g2p convert "tt, tt tt" mic mic-ipa outputs ətt, tt tət

Several other mappings use $ one way or another.

Not sure what the best solution is. \b is also not always right, (e.g., it's incompatible with prevent_feeding). It fixes French, in any case.

Mohawk mapping should support NFD characters

ReadAlong-Studio fails with NFDecomposed characters

Documentation on configuration options

Setting things like as_is and prevent_feeding properly can be the difference between rules working, and not working. There have been a lot of internal discussions about these things, but we need to document this, along with use cases for when to use each option, and what the defaults are.

g2p doctor issues deprecated warning

g2p doctor -m fra outputs this warning message:

/home/joa125/u/anaconda3/envs/ilt/lib/python3.7/site-packages/panphon/distance.py:53: YAMLLoadWarning: calling yaml.load() without Loader=... is deprecated, as the default Loader is unsafe. Please read https://msg.pyyaml.org/load for full details.
  dogol_prime = yaml.load(f.read())

Investigate and fix, presumably by adding a Loader as indicated in the message, though making sure the fixed code also works with Python 3.6.

AttributeError: 'OutStream' object has no attribute 'buffer'

from g2p import make_g2p

Traceback (most recent call last):

  File "<ipython-input-1-7add92db992e>", line 1, in <module>
    from g2p import make_g2p

  File "C:\ProgramData\Anaconda3\lib\site-packages\g2p\__init__.py", line 18, in <module>
    sys.stdout = io.TextIOWrapper(sys.stdout.buffer, encoding="utf8")

AttributeError: 'OutStream' object has no attribute 'buffer'

Last version of g2p and python 3.7.7

Construction interfaces for Mappings

I'm attempting to create a transducer (in this case, Inuktitut SRO to IPA) but the method in the README doesn't seem to work. I checked the source and the Mapping constructor does not appear to do anything with a "language" kwarg.

The interface in which you go mapping="" initially appears to work [1], but there's no means to specify the mapping index through this interface, and in this case it's the third.

The interface in which you specify an in_lang and out_lang loads the mapping, but then the transducer I made from it just leaves everything intact.

[1] Actually, first it needed a fix where we specify the encoding kwarg of open() to "utf-8", since that's a platform-dependent default rather than a cross-platform default. Python 3.6 and 3.7 on Windows lag behind for some reason on updating open() to a UTF default.

Dummy mapping generation is escaping unicode escapes

When running the dummy mapping generation, ex. g2p generate-mapping dan --dummy it is escaping the rules, such that a mapping from \u0061\u030a is becoming \\u0061\\u030a. This is a bug.

Flask 2.0.1 incompatibility with g2p

Well, I guess we should have said Flask==2.0.0, not Flask>=2.0.0.
Turns out we're not compatible with 2.0.1 - see https://travis-ci.com/github/roedoejet/g2p/builds/227010018

A possible patch is to change g2p/api.py line 128, where the problem occurs, to read

g2p_api = Blueprint('resources_g2p', __name__)

instead of

g2p_api = Blueprint('resources.g2p', __name__)

The issue is that with 2.0.1, Blueprint raises ValueError if "." in name.

But I don't know what impact there could be from changing that resource name. The g2p unit tests still pass, but is there a potential impact on convertextract or something else?

English mapping has strange /g/ phoneme

I noticed that "g" was somehow falling through in conversions from oji to eng-arpabet, and couldn't figure out why, until I noticed that the "g" in eng_ipa_to_arpabet.json is not the standard ASCII "g" but actually U+0261 LATIN SMALL LETTER SCRIPT G.

Is there a good reason for this? Should we change it to good old fashioned U+0067 LATIN SMALL LETTER G or should we keep both of them?

g2p is not currently compatible with Python 3.10 - because of eventlet

Summary: it appears that neither eventlet 0.30.2 nor the latest are compatible with Python 3.10, or possibly with some of the library versions we require in g2p.

setup:

conda create -p ./conda-venv-3.10-ras python==3.10.0
conda activate ./conda-venv-3.10-ras
conda install ffmpeg
pip install g2p

command that errors out (simpler yet: python -c "import eventlet" shows the same error):

$ g2p -h
Traceback (most recent call last):
  File "/home/joanise/conda-venv-3.10-ras/bin/g2p", line 5, in <module>
    from g2p.cli import cli
  File "/home/joanise/conda-venv-3.10-ras/lib/python3.10/site-packages/g2p/cli.py", line 13, in <module>
    from g2p.app import APP, network_to_echart
  File "/home/joanise/conda-venv-3.10-ras/lib/python3.10/site-packages/g2p/app.py", line 13, in <module>
    from flask_socketio import SocketIO, emit
  File "/home/joanise/conda-venv-3.10-ras/lib/python3.10/site-packages/flask_socketio/__init__.py", line 9, in <module>
    from socketio import socketio_manage  # noqa: F401
  File "/home/joanise/conda-venv-3.10-ras/lib/python3.10/site-packages/socketio/__init__.py", line 9, in <module>
    from .zmq_manager import ZmqManager
  File "/home/joanise/conda-venv-3.10-ras/lib/python3.10/site-packages/socketio/zmq_manager.py", line 5, in <module>
    import eventlet.green.zmq as zmq
  File "/home/joanise/conda-venv-3.10-ras/lib/python3.10/site-packages/eventlet/__init__.py", line 17, in <module>
    from eventlet import convenience
  File "/home/joanise/conda-venv-3.10-ras/lib/python3.10/site-packages/eventlet/convenience.py", line 7, in <module>
    from eventlet.green import socket
  File "/home/joanise/conda-venv-3.10-ras/lib/python3.10/site-packages/eventlet/green/socket.py", line 4, in <module>
    __import__('eventlet.green._socket_nodns')
  File "/home/joanise/conda-venv-3.10-ras/lib/python3.10/site-packages/eventlet/green/_socket_nodns.py", line 11, in <module>
    from eventlet import greenio
  File "/home/joanise/conda-venv-3.10-ras/lib/python3.10/site-packages/eventlet/greenio/__init__.py", line 3, in <module>
    from eventlet.greenio.base import *  # noqa
  File "/home/joanise/conda-venv-3.10-ras/lib/python3.10/site-packages/eventlet/greenio/base.py", line 32, in <module>
    socket_timeout = eventlet.timeout.wrap_is_timeout(socket.timeout)
  File "/home/joanise/conda-venv-3.10-ras/lib/python3.10/site-packages/eventlet/timeout.py", line 166, in wrap_is_timeout
    base.is_timeout = property(lambda _: True)
TypeError: cannot set 'is_timeout' attribute of immutable type 'TimeoutError'

Attempt at a quick fix:

$ pip install --upgrade eventlet
$ g2p -h
Traceback (most recent call last):
  File "/home/joanise/conda-venv-3.10-ras/bin/g2p", line 5, in <module>
    from g2p.cli import cli
  File "/home/joanise/conda-venv-3.10-ras/lib/python3.10/site-packages/g2p/cli.py", line 13, in <module>
    from g2p.app import APP, network_to_echart
  File "/home/joanise/conda-venv-3.10-ras/lib/python3.10/site-packages/g2p/app.py", line 13, in <module>
    from flask_socketio import SocketIO, emit
  File "/home/joanise/conda-venv-3.10-ras/lib/python3.10/site-packages/flask_socketio/__init__.py", line 9, in <module>
    from socketio import socketio_manage  # noqa: F401
  File "/home/joanise/conda-venv-3.10-ras/lib/python3.10/site-packages/socketio/__init__.py", line 9, in <module>
    from .zmq_manager import ZmqManager
  File "/home/joanise/conda-venv-3.10-ras/lib/python3.10/site-packages/socketio/zmq_manager.py", line 5, in <module>
    import eventlet.green.zmq as zmq
  File "/home/joanise/conda-venv-3.10-ras/lib/python3.10/site-packages/eventlet/__init__.py", line 17, in <module>
    from eventlet import convenience
  File "/home/joanise/conda-venv-3.10-ras/lib/python3.10/site-packages/eventlet/convenience.py", line 7, in <module>
    from eventlet.green import socket
  File "/home/joanise/conda-venv-3.10-ras/lib/python3.10/site-packages/eventlet/green/socket.py", line 21, in <module>
    from eventlet.support import greendns
  File "/home/joanise/conda-venv-3.10-ras/lib/python3.10/site-packages/eventlet/support/greendns.py", line 66, in <module>
    setattr(dns, pkg, import_patched('dns.' + pkg))
  File "/home/joanise/conda-venv-3.10-ras/lib/python3.10/site-packages/eventlet/support/greendns.py", line 61, in import_patched
    return patcher.import_patched(module_name, **modules)
  File "/home/joanise/conda-venv-3.10-ras/lib/python3.10/site-packages/eventlet/patcher.py", line 132, in import_patched
    return inject(
  File "/home/joanise/conda-venv-3.10-ras/lib/python3.10/site-packages/eventlet/patcher.py", line 109, in inject
    module = __import__(module_name, {}, {}, module_name.split('.')[:-1])
  File "/home/joanise/conda-venv-3.10-ras/lib/python3.10/site-packages/dns/namedict.py", line 35, in <module>
    class NameDict(collections.MutableMapping):
AttributeError: module 'collections' has no attribute 'MutableMapping'

unicode escape | operator in mapping

Should allow for inputs like 'K\u0331|K\u0332'. This currently crashes the g2p library.

LANGS_AVAILABLE is not complete

This is an issue at the intersection of g2p and ReadAlongs/Studio: the variable LANGS_AVAILABLE in g2p.mappings.langs does not include all languages available for mapping.

Currently, the list, obtained by calling readalongs align -h or by giving an invalid language code to the -l option, is:

alq, atj, ckt, crj, crk, crl, crm, csw, ctp, dan, fra, git, gla, iku, kkz, lml, moh, oji, see, srs, str, tce, tgx, tli, und, win, eng

but the full list, if we ignore the *-ipa instances, is:

alq, atj, ckt, crg-tmd, crg-dv, crj, crj-norm, crk-no-symbols, crk, crl, crl-norm, crm, crm-norm, csw, csw-norm, ctp, dan, fra, git, git, git, gla, iku, kkz, kwk-napa, kwk-umista, kwk-umista, kwk-boas, lml, moh, moh, oji, oji-syl, see, srs, str, tce, tce-norm, tgx, tli, tli-norm, und, win, eng

Languages currently missing:

crg-tmd, crg-dv,
*-norm for * in [crj, crl, crm, csw, tce, tli]
kwk-napa, kwk-umista, kwk-umista, kwk-boas
crk-no-symbols
oji-syl

The *-norm and *-no-symbols probably don't belong in the list, since (I believe) they are intermediate representations, but the others need to be included since a user might have them as input language to create a read along.

Initial patch proposal, used to create the extended list above:

LANGS_AVAILABLE = [{mapping['in_lang']: mapping['language_name']} for k, v in LANGS.items() for mapping in v['mappings'] if not mapping['in_lang'].endswith("-ipa")]

compare with the current code:

LANGS_AVAILABLE = [{k: v['language_name']} for k, v in LANGS.items() if k not in ['generated', 'font-encodings']]

"KeyError: 5" when doing g2p on French branch

When running

g2p convert "manger" fra eng-ipa --debugger

in my dev branch, I get a "KeyError: 5" exception. See trace log:

Traceback (most recent call last):
  File "/home/joa125/u/anaconda3/envs/ilt/bin/g2p", line 11, in <module>
    load_entry_point('g2p', 'console_scripts', 'g2p')()
  File "/home/joa125/u/anaconda3/envs/ilt/lib/python3.7/site-packages/click/core.py", line 764, in __call__
    return self.main(*args, **kwargs)
  File "/home/joa125/u/anaconda3/envs/ilt/lib/python3.7/site-packages/flask/cli.py", line 557, in main
    return super(FlaskGroup, self).main(*args, **kwargs)
  File "/home/joa125/u/anaconda3/envs/ilt/lib/python3.7/site-packages/click/core.py", line 717, in main
    rv = self.invoke(ctx)
  File "/home/joa125/u/anaconda3/envs/ilt/lib/python3.7/site-packages/click/core.py", line 1137, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/joa125/u/anaconda3/envs/ilt/lib/python3.7/site-packages/click/core.py", line 956, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/joa125/u/anaconda3/envs/ilt/lib/python3.7/site-packages/click/core.py", line 555, in invoke
    return callback(*args, **kwargs)
  File "/home/joa125/u/anaconda3/envs/ilt/lib/python3.7/site-packages/click/decorators.py", line 17, in new_func
    return f(get_current_context(), *args, **kwargs)
  File "/home/joa125/u/anaconda3/envs/ilt/lib/python3.7/site-packages/flask/cli.py", line 412, in decorator
    return __ctx.invoke(f, *args, **kwargs)
  File "/home/joa125/u/anaconda3/envs/ilt/lib/python3.7/site-packages/click/core.py", line 555, in invoke
    return callback(*args, **kwargs)
  File "/gpfs/fs1/nrc/ict/others/u/joa125/sandboxes/readalong/g2p-fra/g2p/cli.py", line 74, in convert
    PRINTER.pprint(transducer(input_text, index=index, debugger=debugger))
  File "/gpfs/fs1/nrc/ict/others/u/joa125/sandboxes/readalong/g2p-fra/g2p/transducer/__init__.py", line 475, in __call__
    return self.apply_rules(to_convert, index, debugger)
  File "/gpfs/fs1/nrc/ict/others/u/joa125/sandboxes/readalong/g2p-fra/g2p/transducer/__init__.py", line 482, in apply_rules
    response = transducer(converted, index, debugger)
  File "/gpfs/fs1/nrc/ict/others/u/joa125/sandboxes/readalong/g2p-fra/g2p/transducer/__init__.py", line 72, in __call__
    return self.apply_rules(to_convert, index, debugger)
  File "/gpfs/fs1/nrc/ict/others/u/joa125/sandboxes/readalong/g2p-fra/g2p/transducer/__init__.py", line 371, in apply_rules
    if indices[k]['input_string'] != new_index[k]['input_string'] and len(intermediate_to_convert) - 1 >= k and new_index[k]['input_string'] == intermediate_to_convert[k]:
KeyError: 5

Will push a dev branch to my fork shortly.

tli doesn't map ' to a proper IPA symbol

Discovered while moving the tokenizer to g2p:

tli does not map to ' to a proper IPA symbol:

$ g2p convert  "k'w" tli tli-ipa
k'ʷ
$ g2p convert  "k'w" tli eng-arpabet
K 'W

Additional side effect: the tokenizer for tli will not recognize ' as being part of the word.

Question: is it correct to map it to the glottal stop, like in other west coast languages:

',ʔ

deleting a cell in gi-to-pi Studio causes exception

when using the Gⁱ-to-Pⁱ Studio, if you hit the Delete key or using Ctrl-X inside the Custom Rules spreadsheet, you get an Exception.

Trace, after entering a, e and hitting Delete in the context before column, with fae as the text to map.

Notice the ["a","e",null,""] in the mappings, which causes the problem.

INFO - 276110fdfcbf46859a87a2bb5a24c5a0: Received packet MESSAGE data 2/convert,["conversion event",{"data":{"input_string":"fae","mappings":[["a","e",null,""],["","","",""],["","","",""],["","","",""],["","","",""],["","","",""],["","","",""],["","","",""],["","","",""],["","","",""]],"abbreviations":[["Vowels","a","e","i","o","u"],["","","","","",""],["","","","","",""],["","","","","",""],["","","","","",""],["","","","","",""],["","","","","",""],["","","","","",""],["","","","","",""],["","","","","",""],["","","","","",""]],"kwargs":{"as_is":true,"case_sensitive":true,"escape_special":false,"reverse":false}}}]
INFO - received event "conversion event" from 276110fdfcbf46859a87a2bb5a24c5a0 [/convert]
INFO - 142.98.46.127 - - [26/Sep/2019 15:18:00] "POST /socket.io/?EIO=3&transport=polling&t=Mrl0Zmn&sid=276110fdfcbf46859a87a2bb5a24c5a0 HTTP/1.1" 200 -
Exception in thread Thread-24:
Traceback (most recent call last):
  File "/home/joa125/u/anaconda3/envs/ilt/lib/python3.7/threading.py", line 917, in _bootstrap_inner
    self.run()
  File "/home/joa125/u/anaconda3/envs/ilt/lib/python3.7/threading.py", line 865, in run
    self._target(*self._args, **self._kwargs)
  File "/home/joa125/u/anaconda3/envs/ilt/lib/python3.7/site-packages/socketio/server.py", line 648, in _handle_event_internal
    r = server._trigger_event(data[0], namespace, sid, *data[1:])
  File "/home/joa125/u/anaconda3/envs/ilt/lib/python3.7/site-packages/socketio/server.py", line 677, in _trigger_event
    return self.handlers[namespace][event](*args)
  File "/home/joa125/u/anaconda3/envs/ilt/lib/python3.7/site-packages/flask_socketio/__init__.py", line 277, in _handler
    *args)
  File "/home/joa125/u/anaconda3/envs/ilt/lib/python3.7/site-packages/flask_socketio/__init__.py", line 680, in _handle_event
    ret = handler(*args)
  File "/gpfs/fs1/nrc/ict/others/u/joa125/sandboxes/readalong/g2p/g2p/__init__.py", line 94, in convert
    message['data']['abbreviations']), **message['data']['kwargs'])
  File "/gpfs/fs1/nrc/ict/others/u/joa125/sandboxes/readalong/g2p/g2p/mappings/__init__.py", line 82, in __init__
    if key in ['in', 'out', 'context_before', 'context_after'] and re.search(abb_match, io[key]):
  File "/home/joa125/u/anaconda3/envs/ilt/lib/python3.7/re.py", line 183, in search
    return _compile(pattern, flags).search(string)
TypeError: expected string or bytes-like object

Expose g2p as RESTful API through g2p studio

This should be documented with Swagger

Socket loses connection silently

Clicking on the output text area, or adding rows to the table stops the g2p from working and seemingly disconnects the socket, but the UI is not updated.

Log non-ipa characters for ipa mappings

There should be some logging or notification when a mapping that is suffixed with -ipa uses non-standard characters in the output. For example, ʦ instead of t͡s or g (\u0047)) instead of ɡ (\u0261)

missing dependency: Travis (+ linux?) can't read xlsx files

Travis seems to be missing a dependency of some sort because when it parses xlsx files, it produces empty strings where it should produce other values. This does not happen locally on my Mac, but happens on @joanise's Linux machine.

{"in": "a", "out": "b", "context_before": "a", "context_after": "b"}
{"in": "", "out": "", "context_before": "", "context_after": ""}

Using . (dot) in context_before eats up that character

TLDR: if a rule has context_before = ".", the character before in is incorrectly deleted.

This is clearly related to issue #15 but apparently I found another situation that trips that patch.

Long story:

In French, I want a rule that says delete word-final "s" as long as it's not also word initial, i.e., as long as "s" is not the whole word.

In branch dev.fra, commit 0233c15, I have this rule in fra/fra_to_ipa.csv that attempts to accomplish that:

s,,.,\s|$

it replaces my rule s,,,\s|$ in the same file in branch master but attempts to avoid having it applied to "s".

However, when this rule is applied, it erases both the "s" and the character that precedes it:

$ g2p convert écoutons fra fra-ipa --debugger
[...]
    [   {   'end': 9,
            'input': 'écoutons',
            'output': 'écouto',
            'rule': {   'context_after': '\\s|$',
                        'context_before': '.',
                        'in': 's',
                        'out': ''},
            'start': 7},
[...]

As shown above, the output of applying my rule is "écouto" instead of "écouton" as intended.

Need to perform tokenization in g2p conversions

When ReadAlong/Studio call g2p, it does so on tokenized text, so that each word is parsed as a single string, and ^ can match the beginning of the work, and $ the end of the word.

When g2p convert or convertextract are used, the input text (or maybe line) is passed as a whole, so that ^ and $ match the beginning and end of line, respectively, instead of the beginning of the word.

Affected mappings: Mappings mic/mic_to_ipa.json and fra/fra_to_ipa encode rules that are sensitive to the end or beginning of words, and only work correctly on single words. The same is true of mapping git/Orthography.csv and git/Orthography_Deterministic.csv, but those are not in use so they're not an issue.

Showing the problem:

$ g2p convert "sq" mic mic-ipa
səx
$ g2p convert "sq sq sq" mic mic-ipa
əsx sx səx

In the second command, the first word matches s in word-initial position, and third one matches q in word-final position, and the middle one matches neither. The correct output should have been səx səx səx.

A similar problem exists in French, where I tried to match spaces around the words as also marking beginning and end of words, but with a logic that fails to apply when there is punctuation present:

$ g2p convert "Ceci est un test test." fra fra-ipa
sʌsi ɛ œ̃ tɛ tʌst.

Although neither tɛ nor tʌst is great (tɛst would have been better), we would like the two to be mapped identically.

Possible solution:

In readalongs/text/tokenize_xml we have logic that tokenizes text along this rule:

a string is part of a token if it appears on the "in" side of any rule in its mapping to IPA
remaining characters are part of a token if they are Unicode types "letter", "number" or "diacritic"
everything else is not part of a token.

While this logic is necessary in readalongs, I think it could reasonably belong inside g2p, since it is tightly related to the g2p mappings.

Then, g2p convert, g2p scan, convertextract, etc, could all use the following algorithm, that readalongs align already effectively uses:

tokenize the input string into an alternative sequence of tokens and non tokens.
map all tokens in the sequence
print the mapped tokens and the unchanged non tokens in the original order they appeared

The benefit would be that applying a g2p mapping in any context would always produce the same output.

Upper case characters not recognized by csv rules

Any uppercase characters are simply disregarded by the current csv rules. Any character at the beginning of a sentence in the Orthography line is not being transduced.

'N -->ʔN is being read as only '
'n --> n̓ the lower case version produces an entirely different output

Unsure how to solve this, as it does not need to account only for ascii characters (Heiltsuk).

Morpheme-break non-voicing is not accounted for in csv rules

In order to achieve greater accuracy with voicing, the csv rules would need to account for morpheme breaks. Voicing does not occur when a morpheme-initial stop is preceded by a vowel, while all other incidences of stops preceded by vowels are voiced.

Current csv rule:
t,d,a|ʌ|æ|e|ɛ|ɪ|ɨ|i|ɔ|o|u

Outputs a voiced stop every time a vowel precedes a stop.

A rule that accounts for the morpheme break would be needed to block voicing.

CRL mapping needs normalization

Here's my description from the crl mapping readme in dev.crl - Can somebody help with this? I'm not sure which way to normalize to?

AP: There seems to be a problem here with normalization. Most of the rules for long vowels are declared with \u1427 "canadian syllabics final middle dot", so ᐧᐋ is a sequence of \u1427\140B, but there also appears to be a specific code point for waa: \u1419. I've added a crl_norm.json that normalizes the sequence to the single codepoint for that character and changed the crl_to_ipa.json mapping to use \u1419 instead of \u1427\140B, but I'm not sure if this was the right choice. Either way, there needs to be some sort of normalization step here to handle real world input.

Change Correspondence to Mapping

Can't find language "eng" when converting

Not sure what's going on here, since the log indicates that it has definitely found the languages.

Also, is there some good reason why the lists of mappings are now stored in binary pickle files instead of, say, just looking them up in the filesystem like we were doing before? It seems brittle and opaque to me.

INFO - Adding mapping between eng-ipa and eng-arpabet to composite transducer.
ERROR - No lang called eng. Please try again.
Traceback (most recent call last):
  File "/home/dhd/py/readalongs3.7/bin/readalongs", line 11, in <module>
    load_entry_point('readalongs', 'console_scripts', 'readalongs')()
  File "/home/dhd/py/readalongs3.7/lib/python3.7/site-packages/click/core.py", line 764, in __call__
    return self.main(*args, **kwargs)
  File "/home/dhd/py/readalongs3.7/lib/python3.7/site-packages/flask/cli.py", line 557, in main
    return super(FlaskGroup, self).main(*args, **kwargs)
  File "/home/dhd/py/readalongs3.7/lib/python3.7/site-packages/click/core.py", line 717, in main
    rv = self.invoke(ctx)
  File "/home/dhd/py/readalongs3.7/lib/python3.7/site-packages/click/core.py", line 1137, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/dhd/py/readalongs3.7/lib/python3.7/site-packages/click/core.py", line 956, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/dhd/py/readalongs3.7/lib/python3.7/site-packages/click/core.py", line 555, in invoke
    return callback(*args, **kwargs)
  File "/home/dhd/py/readalongs3.7/lib/python3.7/site-packages/click/decorators.py", line 17, in new_func
    return f(get_current_context(), *args, **kwargs)
  File "/home/dhd/py/readalongs3.7/lib/python3.7/site-packages/flask/cli.py", line 412, in decorator
    return __ctx.invoke(f, *args, **kwargs)
  File "/home/dhd/py/readalongs3.7/lib/python3.7/site-packages/click/core.py", line 555, in invoke
    return callback(*args, **kwargs)
  File "/home/dhd/work/ReadAlong-Studio/readalongs/cli.py", line 103, in align
    if kwargs['save_temps'] else None))
  File "/home/dhd/work/ReadAlong-Studio/readalongs/align.py", line 109, in align_audio
    xml = convert_xml(xml)
  File "/home/dhd/work/ReadAlong-Studio/readalongs/text/convert_xml.py", line 179, in convert_xml
    convert_words(xml_copy, word_unit, output_orthography)
  File "/home/dhd/work/ReadAlong-Studio/readalongs/text/convert_xml.py", line 126, in convert_words
    converter = make_g2p(unit['lang'], output_orthography)
  File "/home/dhd/py/readalongs3.7/lib/python3.7/site-packages/g2p/__init__.py", line 132, in make_g2p
    raise(FileNotFoundError)
FileNotFoundError

Danish g2p does not process "frygt" correctly

The "r" in "frygt" incorrectly stays as is in the eng-arpabet output.

$ g2p convert   frygt dan eng-arpabet
INFO - Server initialized for eventlet.
F rUW Y T

This causes errors in readalongs align with the Danish UDHR.

A similar error occurs with "undertrykkelse":

$ g2p convert undertrykkelse dan eng-arpabet
INFO - Server initialized for eventlet.
UW N D EH Y T rUW K K EH L S EH

Source for these two words: https://www.ohchr.org/EN/UDHR/Pages/Language.aspx?LangID=dns

The last five lines in fn-unicode-font mapping don't seem to work on Windows

g2p convert 'X   ล ɤ ∛ X' fn-unicode-font fn-unicode

where the input starts and ends with ascii X, and has the literal character on the left-hand side of each of the last five rules in font-encodings/fn_unicode.csv in between.

Run on Windows, this command outputs this line, where each ? is really a literal \x3f in the output:

x ? ? ? ? ? x

Run on Linux, this command produces the expected output:

x ᶿ √ ḥ ɣ · x

Abbreviation names cannot be substrings

If I create a mapping with abbreviations, and use names that are substrings of each other, it breaks the interpretation.

To reproduce, modify mappings/langs/fra/fra_abbs.csv and rename AOU_VOW to AOU_VOWEL and IE_VOW to EI_VOWEL. Make the corresponding changes in fra_to_ipa.csv, run g2p update, then tests/test_langs.py now fails.

Running g2p convert --debugger mangeons fra fra-ipa shows that AOU_VOWEL was expanded to AOU_{value of VOWEL} instead of {value of AOU_VOWEL}.

Unintended rule interactions

Because the rules in g2p apply in sequence, and because mappings can be normalized to either NFC or NFD, some unintended bleeding and feeding interactions can happen. I think this is the root cause of some of the weird problems we were blaming on normalization before.

For example, in the ctp -> ctp-ipa mapping, there is a rule that transforms kw -> kʷ. There is also a rule that transforms k -> kʲ. If the first rule is ordered before the second, it will feed it and an input of kw will output kʲʷ. If the second rule is ordered before the first, it will bleed it and an input of kw will output kʲw.

The same problem can happen as a result of normalization. In ctp-ipa -> eng-ipa, there is a rule õ -> õː and o -> oː which results in the same issues as with the k rules when normalized to NFD (but not with NFC).

Chris Cox solved this in his tgx mapping by creating an intermediate form like so:

    { "in": "tʼ",	"out": "1R" },
    { "in": "t",	"out": "tʰ" },
    { "in": "ʼ",	"out": "ʔ" },
    { "in": "1",	"out": "t" },
    { "in": "R",	"out": "ʼ" },

But is this how we should recommend solving this problem? Should we check for this type of interaction (and maybe insert the intermediate forms) by default? Or should we encourage a more regex-centric approach like changing k -> kʲ to k(?!ʷ) -> kʲ and add a new syntax into the mapping files to allow for negative lookaheads/lookbehinds?

Change default Unicode normalization to NFC

make g2p generate-mapping more flexible

When create a mapping from lang-ipa to eng-ipa using g2p generate-mapping --ipa lang, the software expects that a mapping lang -> lang-ipa exists, but won't work if it's called lang-norm -> lang-ipa instead.

As a work-around, one has to temporarily rename the existing mapping and then rename the generated one. E.g., lang-norm -> lang-norm-ipa will get lang-norm-ipa -> eng-ipa generated, and then lang-norm-ipa can be renamed back to lang-ipa in both languages.

Task 1: make generate-mapping more flexible, so that such inputs are accepted as is and handled correctly.

Task 2: in some cases, there might be more that one mapping into lang-ipa, e.g., crg-tmd -> crg-ipa and crg-dv -> crg-ipa. With this case, generate-mapping should work on the union of the two mappings to create one unified cra-ipa -> eng-ipa.

Replace lower-casing of expressions with exclusive use of RE case insensitive flag

As for case_sensitive by rule, I don't like it. \S is not a letter, it's an RE symbol. And when I say match "s" after "\S", I still want "s" to be case insensitive. I believe it is accurate to claim that you only need to protect one letter after each backslash, I'm not aware of any other use of letters in REs that we could want to protect. Actually, why don't you use re.match's own case insensitive option instead of lower-casing the expression?

Originally posted by @joanise in #31 (comment)

Empty "out" (e.g. for deletion) causes error

An empty string for the "out" in the conversion table (in this case, for a tone-marking letter with no segmental pronunciation) causes a warning (WARNING - Sorry, something went wrong. Try checking the two IOStates objects you're trying to compose) and later an error in alignment, as the word ends up missing from the output table.

regex character escape happening more than once

When a mapping config is set to escape special characters, it should only do this once! Otherwise silly things like this happen:

Escaped special characters in '\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\?' with '\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\?''. Set 'escape_special' to False in your Mapping configuration to disable this.

g2p doctor needs to process \u escape sequences

Problem:
g2p doctor -m see-ipa
complains about the six rules in see/see_to_ipa.csv that have \u0303 in the output string.

Work-around: rewrite ö:,o\u0303 as ö:,õ.

Correct solution: check_ipa_known_segs() in mappings/langs/utils.py needs to apply the \uNNNN escape sequences to rule['out'] before checking it.

Certain rules eating characters!

It seems that when context_before includes a regex set, it eats that character.

So, a rule with a,b,c, and input ca is accurately producing cb, but a rule with a,b,[cd], and the same input is just producing b.

roedoejet / g2p Goto Github PK

g2p's Introduction

Gᵢ2Pᵢ

Table of Contents

Background

Install

Usage

Writing mapping files

CSV

JSON

Python

CLI

update

convert

generate-mapping

g2p workflow diagram

Studio

API for Developers

Maintainers

Contributing

Adding a new mapping

Adding a new language for support with ReadAlongs

Footnotes

Contributors

Citation

License

g2p's People

Contributors

Stargazers

Watchers

Forkers

g2p's Issues

Recommend Projects

Recommend Topics

Recommend Org

`update`

`convert`

`generate-mapping`