minixc / phones Goto Github PK

View Code? Open in Web Editor NEW

21.0 4.0 2.0 7.45 MB

A collection of utilities for handling IPA phones.

Home Page: https://cdminix.me/phones

License: MIT License

Python 100.00%

phonetics phonetic speech-recognition speech-to-text speech-synthesis linguistics

phones's Introduction

phones

phones is a python library for the easy handling of phones in the International Phonetic Alphabet. These IPA phones can be useful because they can describe how words are pronounced in most languages.

Feature Overview

Extract numeric feature vectors from phones.
Map phones from one language to another by finding the closest phones.
Convert between ARPABET, X-SAMPA/SAMPA and IPA notation.
Compute phone distances.
Do phone arithmetic on a phone and phone-feature level.
Visualise phones and their distances when installing phones[plots].

Installation

For the core libary:

pip install phones

For plotting:

pip install phones[plots]

phones's People

Contributors

Stargazers

Watchers

Forkers

samx81 jaedukseo

phones's Issues

add dialect filter

As pointed out in #4, some languages are made up of a group of specific dialects in phoible.
We need to figure out how to communicate this best and have to add an option to filter by dialect to phones.

I think it might be best to go with something like this.

from phones import PhoneCollection
pc = PhoneCollection()
pc.langs("ast")
>>> ValueError: Need to select a dialect for "ast". Dialects can be listed using the list_dialects flag
pc.langs("ast", list_dialects=True)
>>> ["Asturian (Western)", "Asturian (North-Eastern)"
pc.langs("ast", "Asturian (Western)")

It would be nice to allow something like pc.langs("ast", "western"), which could be achieved by just checking if the dialect string only occurs in one of the dialect options.

basic tests

To avoid future build fails that go undetected, some basic test should be implemented that import all modules and run some basic functionality

`PhoneCollection.values` method fails

Greetings.

I see that the method PhoneCollection.values fails. Using the commands featured in the basic usage section I encounter an error:

>>> from phones import PhoneCollection
>>> pc = PhoneCollection()
>>> ph = pc.langs("eng").values[0]

Traceback (most recent call last):
  File "/usr/local/lib/python3.8/site-packages/pandas/core/groupby/groupby.py", line 1490, in array_func
    result = self.grouper._cython_operation(
  File "/usr/local/lib/python3.8/site-packages/pandas/core/groupby/ops.py", line 959, in _cython_operation
    return cy_op.cython_operation(
  File "/usr/local/lib/python3.8/site-packages/pandas/core/groupby/ops.py", line 657, in cython_operation
    return self._cython_op_ndim_compat(
  File "/usr/local/lib/python3.8/site-packages/pandas/core/groupby/ops.py", line 497, in _cython_op_ndim_compat
    return self._call_cython_op(
  File "/usr/local/lib/python3.8/site-packages/pandas/core/groupby/ops.py", line 541, in _call_cython_op
    func = self._get_cython_function(self.kind, self.how, values.dtype, is_numeric)
  File "/usr/local/lib/python3.8/site-packages/pandas/core/groupby/ops.py", line 173, in _get_cython_function
    raise NotImplementedError(
NotImplementedError: function is not implemented for this dtype: [how->mean,dtype->object]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.8/site-packages/pandas/core/nanops.py", line 1692, in _ensure_numeric
    x = float(x)
ValueError: could not convert string to float: 'a aː aː'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.8/site-packages/pandas/core/nanops.py", line 1696, in _ensure_numeric
    x = complex(x)
ValueError: complex() arg is a malformed string

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.8/site-packages/phones/__init__.py", line 219, in values
    self.data.groupby(
  File "/usr/local/lib/python3.8/site-packages/pandas/core/groupby/groupby.py", line 1855, in mean
    result = self._cython_agg_general(
  File "/usr/local/lib/python3.8/site-packages/pandas/core/groupby/groupby.py", line 1507, in _cython_agg_general
    new_mgr = data.grouped_reduce(array_func)
  File "/usr/local/lib/python3.8/site-packages/pandas/core/internals/managers.py", line 1503, in grouped_reduce
    applied = sb.apply(func)
  File "/usr/local/lib/python3.8/site-packages/pandas/core/internals/blocks.py", line 329, in apply
    result = func(self.values, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/pandas/core/groupby/groupby.py", line 1503, in array_func
    result = self._agg_py_fallback(values, ndim=data.ndim, alt=alt)
  File "/usr/local/lib/python3.8/site-packages/pandas/core/groupby/groupby.py", line 1457, in _agg_py_fallback
    res_values = self.grouper.agg_series(ser, alt, preserve_dtype=True)
  File "/usr/local/lib/python3.8/site-packages/pandas/core/groupby/ops.py", line 994, in agg_series
    result = self._aggregate_series_pure_python(obj, func)
  File "/usr/local/lib/python3.8/site-packages/pandas/core/groupby/ops.py", line 1015, in _aggregate_series_pure_python
    res = func(group)
  File "/usr/local/lib/python3.8/site-packages/pandas/core/groupby/groupby.py", line 1857, in <lambda>
    alt=lambda x: Series(x).mean(numeric_only=numeric_only),
  File "/usr/local/lib/python3.8/site-packages/pandas/core/generic.py", line 11556, in mean
    return NDFrame.mean(self, axis, skipna, numeric_only, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/pandas/core/generic.py", line 11201, in mean
    return self._stat_function(
  File "/usr/local/lib/python3.8/site-packages/pandas/core/generic.py", line 11158, in _stat_function
    return self._reduce(
  File "/usr/local/lib/python3.8/site-packages/pandas/core/series.py", line 4670, in _reduce
    return op(delegate, skipna=skipna, **kwds)
  File "/usr/local/lib/python3.8/site-packages/pandas/core/nanops.py", line 96, in _f
    return f(*args, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/pandas/core/nanops.py", line 158, in f
    result = alt(values, axis=axis, skipna=skipna, **kwds)
  File "/usr/local/lib/python3.8/site-packages/pandas/core/nanops.py", line 421, in new_func
    result = func(values, axis=axis, skipna=skipna, mask=mask, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/pandas/core/nanops.py", line 727, in nanmean
    the_sum = _ensure_numeric(values.sum(axis, dtype=dtype_sum))
  File "/usr/local/lib/python3.8/site-packages/pandas/core/nanops.py", line 1699, in _ensure_numeric
    raise TypeError(f"Could not convert {x} to numeric") from err
TypeError: Could not convert a aː aː to numeric

However, I see that the method that takes allophones into account works as expected:

>>> from phones import PhoneCollection
>>> pc = PhoneCollection()
>>> pc.langs("eng").values_with_allophones[:5]
[aː (eng), b (eng), b (eng), d (eng), d (eng)]

I guess it must be due to the c != self.source.allophone_column part in this line:

phones/src/phones/__init__.py

Line 220 in a497d50

[c for c in self.columns if c != self.source.allophone_column]

I encountered this error in versions 0.0.5 and 0.0.6

Thanks in advance

Some languages not found

First of all, thanks for the repository, it may result very helpful.

I am having some issues, though. I installed the repository via pip. I tried to check for Iberian languages, and surprisingly not Catalan nor Asturian were loaded:

>>> from phones import PhoneCollection
>>> pc = PhoneCollection()
>>> pc.langs('cat').values
[]
>>> pc.langs('ast').values
[]

I checked the .csv file the program is using from phoible (https://raw.githubusercontent.com/phoible/dev/master/data/phoible.csv) and both Asturian and Catalan are present, with those ISO codes (cat and ast). I don't know which could be the problem in this case. I imagine it could be related to the names of the dialects in some way.

Thanks in advance

TypeError: normalize() argument 2 must be str, not float

I'm creating a PhoneCollection with the drop_dialects and merge_same_language flags both set to False in order to load as many languages as possible.

>>> from phones import PhoneCollection
>>> pc=PhoneCollection(drop_dialects=False,merge_same_language=False)

but I get an exception ...

>>> pc=PhoneCollection(drop_dialects=False,merge_same_language=False)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Python310\lib\site-packages\phones\__init__.py", line 77, in __init__
    ].apply(lambda x: unicodedata.normalize("NFC", x))
  File "C:\Python310\lib\site-packages\pandas\core\series.py", line 4433, in apply
    return SeriesApply(self, func, convert_dtype, args, kwargs).apply()
  File "C:\Python310\lib\site-packages\pandas\core\apply.py", line 1088, in apply
    return self.apply_standard()
  File "C:\Python310\lib\site-packages\pandas\core\apply.py", line 1143, in apply_standard
    mapped = lib.map_infer(
  File "pandas\_libs\lib.pyx", line 2870, in pandas._libs.lib.map_infer
  File "C:\Python310\lib\site-packages\phones\__init__.py", line 77, in <lambda>
    ].apply(lambda x: unicodedata.normalize("NFC", x))
TypeError: normalize() argument 2 must be str, not float

I don't know if you've this in hand in #5, but for now I can proceed with a function wrapping unicodedata.normalize in an exception handler

i.e.

        def normalize(x):
            try:
                return unicodedata.normalize("NFC", x)
            except:
                return x

        if self.source.allophone_column is not None:
            self.data[self.source.allophone_column] = self.data[
                self.source.allophone_column
            ].apply(lambda x: normalize(x))

Thanks for a really useful library!!

remove allophone column before grouping to avoid pandas warning

Pointed out in #9

python3.8/site-packages/phones/__init__.py:219: FutureWarning: The default value of numeric_only 
in DataFrameGroupBy.mean is deprecated. In a future version, numeric_only will default to False. 
Either specify numeric_only or select only columns which should be valid for the function.
  self.data.groupby(

add docs for values_with_allophones

Attribute `dialect_list` returns `TypeError`

I know it's not documented in the API reference, but attribute/method/property PhoneCollection.dialect_list returns TypeError:

>>> from phones import PhoneCollection
>>> pc = PhoneCollection(load_dialects=True)
>>> pc.langs("eus").dialect_list
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "[...]/python3.8/site-packages/phones/__init__.py", line 135, in dialect_list
    return list(sorted(self.data[self.source.dialect_column].unique()))
TypeError: '<' not supported between instances of 'float' and 'str'

It marks an error here

I am using version 0.0.4

It might be still on testing phase since it is not documented, but in case you still haven't notice this bug.

Thanks in advance!

add SAMPA conversion

I should add SAMPA as another phonetic alphabet and use the reference here: https://www.phon.ucl.ac.uk/home/sampa/

ɶ not found

"ɶ" which appears when converting "&" from xsampa to ipa, does not seem to be found in the phoible database.

A problem with usage

I have install the library using the pip install phones command and ran the following script:

from phones.convert import Converter

converter = Converter()

but have this error:

Traceback (most recent call last):
  File "ipa_arpa.py", line 3, in <module>
    from phones.convert import Converter
  File "/Users/yehorsmoliakov/opt/miniconda3/lib/python3.8/site-packages/phones/convert.py", line 22, in <module>
    from .phonecodes.src import phonecodes
ModuleNotFoundError: No module named 'phones.phonecodes'

dialect-specific allophones

Hello again.

I found no allophones are being loaded, although marked as default in __init__ of PHOIBLE (allophone_column='Allophones').

I checked it in all phonemes:

>>> from phones import PhoneCollection
>>> pc = PhoneCollection()
>>> {tuple(sorted(p.allophones)) for p in pc.langs(pc.lang_list).values}
{()}
>>>

I might be missing something, though...

I am using version 0.0.4

I am also having this warning, btw:

python3.8/site-packages/phones/__init__.py:219: FutureWarning: The default value of numeric_only 
in DataFrameGroupBy.mean is deprecated. In a future version, numeric_only will default to False. 
Either specify numeric_only or select only columns which should be valid for the function.
  self.data.groupby(

Thanks in advance!

some tokens not recognized as equal in plot_collection

Currently https://cdminix.me/phones/examples/plots/ shows the phones in english and german - but it seems like some phones are not recognized to be shared between those two languages. Maybe this is some kind of float equality issue?

fix documentation generation

probably because of the upgrade to 3.8+ python, the documentation build seems to fail