rspeer / wordfreq Goto Github PK
View Code? Open in Web Editor NEWAccess a database of word frequencies, in various natural languages.
License: Other
Access a database of word frequencies, in various natural languages.
License: Other
Hello my name is Mikel,
I would like to know if there is any possibility of adding a new language to the library. The Basque language.
And if the answer is yes how could i colaborate to make it happen.
Thank you
I'm getting the following error when I try to install:
src/marisa_trie.cpp:17944:34: error: ‘PyTypeObject’ {aka ‘struct _typeobject’} h
as no member named ‘tp_print’
17944 | __pyx_type_11marisa_trie__Trie.tp_print = 0;
| ^~~~~~~~
src/marisa_trie.cpp:17968:39: error: ‘PyTypeObject’ {aka ‘struct _typeobject’} h
as no member named ‘tp_print’
17968 | __pyx_type_11marisa_trie_BinaryTrie.tp_print = 0;
| ^~~~~~~~
src/marisa_trie.cpp:17981:46: error: ‘PyTypeObject’ {aka ‘struct _typeobject’} h
as no member named ‘tp_print’
17981 | __pyx_type_11marisa_trie__UnicodeKeyedTrie.tp_print = 0;
| ^~~~~~~~
src/marisa_trie.cpp:17995:33: error: ‘PyTypeObject’ {aka ‘struct _typeobject’} h
as no member named ‘tp_print’
17995 | __pyx_type_11marisa_trie_Trie.tp_print = 0;
| ^~~~~~~~
src/marisa_trie.cpp:18014:38: error: ‘PyTypeObject’ {aka ‘struct _typeobject’} h
as no member named ‘tp_print’
18014 | __pyx_type_11marisa_trie_BytesTrie.tp_print = 0;
| ^~~~~~~~
src/marisa_trie.cpp:18039:40: error: ‘PyTypeObject’ {aka ‘struct _typeobject’} h
as no member named ‘tp_print’
18039 | __pyx_type_11marisa_trie__UnpackTrie.tp_print = 0;
| ^~~~~~~~
src/marisa_trie.cpp:18052:39: error: ‘PyTypeObject’ {aka ‘struct _typeobject’} h
as no member named ‘tp_print’
18052 | __pyx_type_11marisa_trie_RecordTrie.tp_print = 0;
| ^~~~~~~~
src/marisa_trie.cpp:18070:57: error: ‘PyTypeObject’ {aka ‘struct _typeobject’} h
as no member named ‘tp_print’
18070 | __pyx_type_11marisa_trie___pyx_scope_struct____init__.tp_print = 0;
| ^~~~~~~~
src/marisa_trie.cpp:18076:57: error: ‘PyTypeObject’ {aka ‘struct _typeobject’} h
as no member named ‘tp_print’
18076 | __pyx_type_11marisa_trie___pyx_scope_struct_1_genexpr.tp_print = 0;
| ^~~~~~~~
src/marisa_trie.cpp:18082:58: error: ‘PyTypeObject’ {aka ‘struct _typeobject’} h
as no member named ‘tp_print’
18082 | __pyx_type_11marisa_trie___pyx_scope_struct_2_iterkeys.tp_print = 0;
| ^~~~~~~~
src/marisa_trie.cpp:18088:63: error: ‘PyTypeObject’ {aka ‘struct _typeobject’} h
as no member named ‘tp_print’
18088 | __pyx_type_11marisa_trie___pyx_scope_struct_3_iter_prefixes.tp_print = 0
;
| ^~~~~~~~
src/marisa_trie.cpp:18094:59: error: ‘PyTypeObject’ {aka ‘struct _typeobject’} h
as no member named ‘tp_print’
18094 | __pyx_type_11marisa_trie___pyx_scope_struct_4_iteritems.tp_print = 0;
| ^~~~~~~~
src/marisa_trie.cpp:18100:63: error: ‘PyTypeObject’ {aka ‘struct _typeobject’} h
as no member named ‘tp_print’
18100 | __pyx_type_11marisa_trie___pyx_scope_struct_5_iter_prefixes.tp_print = 0
;
| ^~~~~~~~
src/marisa_trie.cpp:18106:59: error: ‘PyTypeObject’ {aka ‘struct _typeobject’} h
as no member named ‘tp_print’
18106 | __pyx_type_11marisa_trie___pyx_scope_struct_6_iteritems.tp_print = 0;
| ^~~~~~~~
src/marisa_trie.cpp:18112:58: error: ‘PyTypeObject’ {aka ‘struct _typeobject’} h
as no member named ‘tp_print’
18112 | __pyx_type_11marisa_trie___pyx_scope_struct_7___init__.tp_print = 0;
| ^~~~~~~~
src/marisa_trie.cpp:18118:57: error: ‘PyTypeObject’ {aka ‘struct _typeobject’} h
as no member named ‘tp_print’
18118 | __pyx_type_11marisa_trie___pyx_scope_struct_8_genexpr.tp_print = 0;
| ^~~~~~~~
src/marisa_trie.cpp:18124:59: error: ‘PyTypeObject’ {aka ‘struct _typeobject’} h
as no member named ‘tp_print’
18124 | __pyx_type_11marisa_trie___pyx_scope_struct_9_iteritems.tp_print = 0;
| ^~~~~~~~
src/marisa_trie.cpp:18130:59: error: ‘PyTypeObject’ {aka ‘struct _typeobject’} h
as no member named ‘tp_print’
18130 | __pyx_type_11marisa_trie___pyx_scope_struct_10_iterkeys.tp_print = 0;
| ^~~~~~~~
src/marisa_trie.cpp:18136:59: error: ‘PyTypeObject’ {aka ‘struct _typeobject’} h
as no member named ‘tp_print’
18136 | __pyx_type_11marisa_trie___pyx_scope_struct_11___init__.tp_print = 0;
| ^~~~~~~~
src/marisa_trie.cpp:18142:58: error: ‘PyTypeObject’ {aka ‘struct _typeobject’} h
as no member named ‘tp_print’
18142 | __pyx_type_11marisa_trie___pyx_scope_struct_12_genexpr.tp_print = 0;
| ^~~~~~~~
src/marisa_trie.cpp:18148:60: error: ‘PyTypeObject’ {aka ‘struct _typeobject’} h
as no member named ‘tp_print’
18148 | __pyx_type_11marisa_trie___pyx_scope_struct_13_iteritems.tp_print = 0;
| ^~~~~~~~
src/marisa_trie.cpp:18154:58: error: ‘PyTypeObject’ {aka ‘struct _typeobject’} h
as no member named ‘tp_print’
18154 | __pyx_type_11marisa_trie___pyx_scope_struct_14_genexpr.tp_print = 0;
| ^~~~~~~~
In file included from /usr/include/python3.9/unicodeobject.h:1026,
from /usr/include/python3.9/Python.h:97,
from src/marisa_trie.cpp:4:
src/marisa_trie.cpp: In function ‘int __Pyx_ParseOptionalKeywords(PyObject*, PyO
bject***, PyObject*, PyObject**, Py_ssize_t, const char*)’:
/usr/include/python3.9/cpython/unicodeobject.h:451:75: warning: ‘Py_ssize_t _PyU
nicode_get_wstr_length(PyObject*)’ is deprecated [-Wdeprecated-declarations]
451 | ine PyUnicode_WSTR_LENGTH(op) _PyUnicode_get_wstr_length((PyObject*)op)
| ^
/usr/include/python3.9/cpython/unicodeobject.h:261:7: note: in expansion of macr
o ‘PyUnicode_WSTR_LENGTH’
261 | PyUnicode_WSTR_LENGTH(op) : \
| ^~~~~~~~~~~~~~~~~~~~~
src/marisa_trie.cpp:18940:22: note: in expansion of macro ‘PyUnicode_GET_SIZE’
18940 | (PyUnicode_GET_SIZE(**name) != PyUnicode_GET_SIZE(ke
y)) ? 1 :
| ^~~~~~~~~~~~~~~~~~
/usr/include/python3.9/cpython/unicodeobject.h:446:26: note: declared here
446 | static inline Py_ssize_t _PyUnicode_get_wstr_length(PyObject *op) {
| ^~~~~~~~~~~~~~~~~~~~~~~~~~
/usr/include/python3.9/cpython/unicodeobject.h:262:52: warning: ‘Py_UNICODE* PyU
nicode_AsUnicode(PyObject*)’ is deprecated [-Wdeprecated-declarations]
262 | ((void)PyUnicode_AsUnicode(_PyObject_CAST(op)),\
| ^
src/marisa_trie.cpp:18940:22: note: in expansion of macro ‘PyUnicode_GET_SIZE’
18940 | (PyUnicode_GET_SIZE(**name) != PyUnicode_GET_SIZE(ke
y)) ? 1 :
| ^~~~~~~~~~~~~~~~~~
/usr/include/python3.9/cpython/unicodeobject.h:580:45: note: declared here
580 | Py_DEPRECATED(3.3) PyAPI_FUNC(Py_UNICODE *) PyUnicode_AsUnicode(
| ^~~~~~~~~~~~~~~~~~~
/usr/include/python3.9/cpython/unicodeobject.h:451:75: warning: ‘Py_ssize_t _PyU
nicode_get_wstr_length(PyObject*)’ is deprecated [-Wdeprecated-declarations]
451 | ine PyUnicode_WSTR_LENGTH(op) _PyUnicode_get_wstr_length((PyObject*)op)
| ^
/usr/include/python3.9/cpython/unicodeobject.h:264:8: note: in expansion of macr
o ‘PyUnicode_WSTR_LENGTH’
264 | PyUnicode_WSTR_LENGTH(op)))
| ^~~~~~~~~~~~~~~~~~~~~
src/marisa_trie.cpp:18940:22: note: in expansion of macro ‘PyUnicode_GET_SIZE’
18940 | (PyUnicode_GET_SIZE(**name) != PyUnicode_GET_SIZE(ke
y)) ? 1 :
| ^~~~~~~~~~~~~~~~~~
/usr/include/python3.9/cpython/unicodeobject.h:446:26: note: declared here
446 | static inline Py_ssize_t _PyUnicode_get_wstr_length(PyObject *op) {
| ^~~~~~~~~~~~~~~~~~~~~~~~~~
/usr/include/python3.9/cpython/unicodeobject.h:451:75: warning: ‘Py_ssize_t _PyU
nicode_get_wstr_length(PyObject*)’ is deprecated [-Wdeprecated-declarations]
451 | ine PyUnicode_WSTR_LENGTH(op) _PyUnicode_get_wstr_length((PyObject*)op)
| ^
/usr/include/python3.9/cpython/unicodeobject.h:261:7: note: in expansion of macr
o ‘PyUnicode_WSTR_LENGTH’
261 | PyUnicode_WSTR_LENGTH(op) : \
| ^~~~~~~~~~~~~~~~~~~~~
src/marisa_trie.cpp:18940:52: note: in expansion of macro ‘PyUnicode_GET_SIZE’
18940 | (PyUnicode_GET_SIZE(**name) != PyUnicode_GET_SIZE(ke
y)) ? 1 :
| ^~~~~~~~~~~~~~~~~~
/usr/include/python3.9/cpython/unicodeobject.h:446:26: note: declared here
446 | static inline Py_ssize_t _PyUnicode_get_wstr_length(PyObject *op) {
| ^~~~~~~~~~~~~~~~~~~~~~~~~~
/usr/include/python3.9/cpython/unicodeobject.h:262:52: warning: ‘Py_UNICODE* PyU
nicode_AsUnicode(PyObject*)’ is deprecated [-Wdeprecated-declarations]
262 | ((void)PyUnicode_AsUnicode(_PyObject_CAST(op)),\
| ^
src/marisa_trie.cpp:18940:52: note: in expansion of macro ‘PyUnicode_GET_SIZE’
18940 | (PyUnicode_GET_SIZE(**name) != PyUnicode_GET_SIZE(ke
y)) ? 1 :
| ^~~~~~~~~~~~~~~~~~
/usr/include/python3.9/cpython/unicodeobject.h:580:45: note: declared here
580 | Py_DEPRECATED(3.3) PyAPI_FUNC(Py_UNICODE *) PyUnicode_AsUnicode(
| ^~~~~~~~~~~~~~~~~~~
/usr/include/python3.9/cpython/unicodeobject.h:451:75: warning: ‘Py_ssize_t _PyU
nicode_get_wstr_length(PyObject*)’ is deprecated [-Wdeprecated-declarations]
451 | ine PyUnicode_WSTR_LENGTH(op) _PyUnicode_get_wstr_length((PyObject*)op)
| ^
/usr/include/python3.9/cpython/unicodeobject.h:264:8: note: in expansion of macr
o ‘PyUnicode_WSTR_LENGTH’
264 | PyUnicode_WSTR_LENGTH(op)))
| ^~~~~~~~~~~~~~~~~~~~~
src/marisa_trie.cpp:18940:52: note: in expansion of macro ‘PyUnicode_GET_SIZE’
18940 | (PyUnicode_GET_SIZE(**name) != PyUnicode_GET_SIZE(ke
y)) ? 1 :
| ^~~~~~~~~~~~~~~~~~
/usr/include/python3.9/cpython/unicodeobject.h:446:26: note: declared here
446 | static inline Py_ssize_t _PyUnicode_get_wstr_length(PyObject *op) {
| ^~~~~~~~~~~~~~~~~~~~~~~~~~
/usr/include/python3.9/cpython/unicodeobject.h:451:75: warning: ‘Py_ssize_t _PyU
nicode_get_wstr_length(PyObject*)’ is deprecated [-Wdeprecated-declarations]
451 | ine PyUnicode_WSTR_LENGTH(op) _PyUnicode_get_wstr_length((PyObject*)op)
| ^
/usr/include/python3.9/cpython/unicodeobject.h:261:7: note: in expansion of macr
o ‘PyUnicode_WSTR_LENGTH’
261 | PyUnicode_WSTR_LENGTH(op) : \
| ^~~~~~~~~~~~~~~~~~~~~
src/marisa_trie.cpp:18956:26: note: in expansion of macro ‘PyUnicode_GET_SIZE’
18956 | (PyUnicode_GET_SIZE(**argname) != PyUnicode_GET_
SIZE(key)) ? 1 :
| ^~~~~~~~~~~~~~~~~~
/usr/include/python3.9/cpython/unicodeobject.h:446:26: note: declared here
446 | static inline Py_ssize_t _PyUnicode_get_wstr_length(PyObject *op) {
| ^~~~~~~~~~~~~~~~~~~~~~~~~~
/usr/include/python3.9/cpython/unicodeobject.h:262:52: warning: ‘Py_UNICODE* PyU
nicode_AsUnicode(PyObject*)’ is deprecated [-Wdeprecated-declarations]
262 | ((void)PyUnicode_AsUnicode(_PyObject_CAST(op)),\
| ^
src/marisa_trie.cpp:18956:26: note: in expansion of macro ‘PyUnicode_GET_SIZE’
18956 | (PyUnicode_GET_SIZE(**argname) != PyUnicode_GET_
SIZE(key)) ? 1 :
| ^~~~~~~~~~~~~~~~~~
/usr/include/python3.9/cpython/unicodeobject.h:580:45: note: declared here
580 | Py_DEPRECATED(3.3) PyAPI_FUNC(Py_UNICODE *) PyUnicode_AsUnicode(
| ^~~~~~~~~~~~~~~~~~~
/usr/include/python3.9/cpython/unicodeobject.h:451:75: warning: ‘Py_ssize_t _PyU
nicode_get_wstr_length(PyObject*)’ is deprecated [-Wdeprecated-declarations]
451 | ine PyUnicode_WSTR_LENGTH(op) _PyUnicode_get_wstr_length((PyObject*)op)
| ^
/usr/include/python3.9/cpython/unicodeobject.h:264:8: note: in expansion of macr
o ‘PyUnicode_WSTR_LENGTH’
264 | PyUnicode_WSTR_LENGTH(op)))
| ^~~~~~~~~~~~~~~~~~~~~
src/marisa_trie.cpp:18956:26: note: in expansion of macro ‘PyUnicode_GET_SIZE’
18956 | (PyUnicode_GET_SIZE(**argname) != PyUnicode_GET_
SIZE(key)) ? 1 :
| ^~~~~~~~~~~~~~~~~~
/usr/include/python3.9/cpython/unicodeobject.h:446:26: note: declared here
446 | static inline Py_ssize_t _PyUnicode_get_wstr_length(PyObject *op) {
| ^~~~~~~~~~~~~~~~~~~~~~~~~~
/usr/include/python3.9/cpython/unicodeobject.h:451:75: warning: ‘Py_ssize_t _PyU
nicode_get_wstr_length(PyObject*)’ is deprecated [-Wdeprecated-declarations]
451 | ine PyUnicode_WSTR_LENGTH(op) _PyUnicode_get_wstr_length((PyObject*)op)
| ^
/usr/include/python3.9/cpython/unicodeobject.h:261:7: note: in expansion of macr
o ‘PyUnicode_WSTR_LENGTH’
261 | PyUnicode_WSTR_LENGTH(op) : \
| ^~~~~~~~~~~~~~~~~~~~~
src/marisa_trie.cpp:18956:59: note: in expansion of macro ‘PyUnicode_GET_SIZE’
18956 | (PyUnicode_GET_SIZE(**argname) != PyUnicode_GET_
SIZE(key)) ? 1 :
| ^~~~~~~~~~~~~~
~~~~
/usr/include/python3.9/cpython/unicodeobject.h:446:26: note: declared here
446 | static inline Py_ssize_t _PyUnicode_get_wstr_length(PyObject *op) {
| ^~~~~~~~~~~~~~~~~~~~~~~~~~
/usr/include/python3.9/cpython/unicodeobject.h:262:52: warning: ‘Py_UNICODE* PyU
nicode_AsUnicode(PyObject*)’ is deprecated [-Wdeprecated-declarations]
262 | ((void)PyUnicode_AsUnicode(_PyObject_CAST(op)),\
| ^
src/marisa_trie.cpp:18956:59: note: in expansion of macro ‘PyUnicode_GET_SIZE’
18956 | (PyUnicode_GET_SIZE(**argname) != PyUnicode_GET_
SIZE(key)) ? 1 :
| ^~~~~~~~~~~~~~
~~~~
/usr/include/python3.9/cpython/unicodeobject.h:580:45: note: declared here
580 | Py_DEPRECATED(3.3) PyAPI_FUNC(Py_UNICODE *) PyUnicode_AsUnicode(
| ^~~~~~~~~~~~~~~~~~~
/usr/include/python3.9/cpython/unicodeobject.h:451:75: warning: ‘Py_ssize_t _PyU
nicode_get_wstr_length(PyObject*)’ is deprecated [-Wdeprecated-declarations]
451 | ine PyUnicode_WSTR_LENGTH(op) _PyUnicode_get_wstr_length((PyObject*)op)
| ^
/usr/include/python3.9/cpython/unicodeobject.h:264:8: note: in expansion of macr
o ‘PyUnicode_WSTR_LENGTH’
264 | PyUnicode_WSTR_LENGTH(op)))
| ^~~~~~~~~~~~~~~~~~~~~
src/marisa_trie.cpp:18956:59: note: in expansion of macro ‘PyUnicode_GET_SIZE’
18956 | (PyUnicode_GET_SIZE(**argname) != PyUnicode_GET_
SIZE(key)) ? 1 :
| ^~~~~~~~~~~~~~
~~~~
/usr/include/python3.9/cpython/unicodeobject.h:446:26: note: declared here
446 | static inline Py_ssize_t _PyUnicode_get_wstr_length(PyObject *op) {
| ^~~~~~~~~~~~~~~~~~~~~~~~~~
src/marisa_trie.cpp: In function ‘PyObject* __Pyx_PyUnicode_Substring(PyObject*,
Py_ssize_t, Py_ssize_t)’:
src/marisa_trie.cpp:20109:45: warning: ‘PyObject* PyUnicode_FromUnicode(const Py
_UNICODE*, Py_ssize_t)’ is deprecated [-Wdeprecated-declarations]
20109 | return PyUnicode_FromUnicode(NULL, 0);
| ^
In file included from /usr/include/python3.9/unicodeobject.h:1026,
from /usr/include/python3.9/Python.h:97,
from src/marisa_trie.cpp:4:
/usr/include/python3.9/cpython/unicodeobject.h:551:42: note: declared here
551 | Py_DEPRECATED(3.3) PyAPI_FUNC(PyObject*) PyUnicode_FromUnicode(
| ^~~~~~~~~~~~~~~~~~~~~
src/marisa_trie.cpp: In function ‘PyObject* __Pyx_decode_c_string(const char*, P
y_ssize_t, Py_ssize_t, const char*, const char*, PyObject* (*)(const char*, Py_s
size_t, const char*))’:
src/marisa_trie.cpp:20142:45: warning: ‘PyObject* PyUnicode_FromUnicode(const Py
_UNICODE*, Py_ssize_t)’ is deprecated [-Wdeprecated-declarations]
20142 | return PyUnicode_FromUnicode(NULL, 0);
| ^
In file included from /usr/include/python3.9/unicodeobject.h:1026,
from /usr/include/python3.9/Python.h:97,
from src/marisa_trie.cpp:4:
/usr/include/python3.9/cpython/unicodeobject.h:551:42: note: declared here
551 | Py_DEPRECATED(3.3) PyAPI_FUNC(PyObject*) PyUnicode_FromUnicode(
| ^~~~~~~~~~~~~~~~~~~~~
error: Setup script exited with error: command '/usr/bin/gcc' failed with exit c
ode 1
We already have enough data in Czech and Slovak to build wordfreq lists for them. They should be added to the next version.
Other languages that could make it: Persian, Slovenian, Estonian.
But pip install wordfreq[cjk]==2.5.1
does get the extras:
$ pip install wordfreq[cjk]
Collecting wordfreq[cjk]
Downloading wordfreq-3.0.1-py3-none-any.whl (56.8 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 56.8/56.8 MB 9.7 MB/s eta 0:00:00
Collecting ftfy>=6.1
Downloading ftfy-6.1.1-py3-none-any.whl (53 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 53.1/53.1 kB 13.6 MB/s eta 0:00:00
Collecting langcodes>=3.0
Downloading langcodes-3.3.0-py3-none-any.whl (181 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 181.6/181.6 kB 19.4 MB/s eta 0:00:00
Collecting msgpack>=1.0
Downloading msgpack-1.0.4-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (322 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 322.5/322.5 kB 18.8 MB/s eta 0:00:00
Collecting regex>=2020.04.04
Downloading regex-2022.9.13-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (772 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 772.3/772.3 kB 14.7 MB/s eta 0:00:00
Collecting wcwidth>=0.2.5
Downloading wcwidth-0.2.5-py2.py3-none-any.whl (30 kB)
Installing collected packages: wcwidth, msgpack, regex, langcodes, ftfy, wordfreq
Successfully installed ftfy-6.1.1 langcodes-3.3.0 msgpack-1.0.4 regex-2022.9.13 wcwidth-0.2.5 wordfreq-3.0.1
$ pip install wordfreq[cjk]==2.5.1
Collecting wordfreq[cjk]==2.5.1
Downloading wordfreq-2.5.1.tar.gz (56.8 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 56.8/56.8 MB 11.7 MB/s eta 0:00:00
Preparing metadata (setup.py) ... done
Requirement already satisfied: msgpack>=1.0 in ./venv/lib/python3.8/site-packages (from wordfreq[cjk]==2.5.1) (1.0.4)
Requirement already satisfied: langcodes>=3.0 in ./venv/lib/python3.8/site-packages (from wordfreq[cjk]==2.5.1) (3.3.0)
Requirement already satisfied: regex>=2020.04.04 in ./venv/lib/python3.8/site-packages (from wordfreq[cjk]==2.5.1) (2022.9.13)
Requirement already satisfied: ftfy>=3.0 in ./venv/lib/python3.8/site-packages (from wordfreq[cjk]==2.5.1) (6.1.1)
Collecting mecab-python3
Downloading mecab_python3-1.0.5-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (577 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 577.3/577.3 kB 15.4 MB/s eta 0:00:00
Collecting ipadic
Downloading ipadic-1.0.0.tar.gz (13.4 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 13.4/13.4 MB 15.2 MB/s eta 0:00:00
Preparing metadata (setup.py) ... done
Collecting mecab-ko-dic
Downloading mecab-ko-dic-1.0.0.tar.gz (33.2 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 33.2/33.2 MB 13.7 MB/s eta 0:00:00
Preparing metadata (setup.py) ... done
Collecting jieba>=0.42
Downloading jieba-0.42.1.tar.gz (19.2 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 19.2/19.2 MB 16.1 MB/s eta 0:00:00
Preparing metadata (setup.py) ... done
Requirement already satisfied: wcwidth>=0.2.5 in ./venv/lib/python3.8/site-packages (from ftfy>=3.0->wordfreq[cjk]==2.5.1) (0.2.5)
Building wheels for collected packages: jieba, ipadic, mecab-ko-dic, wordfreq
Building wheel for jieba (setup.py) ... done
Created wheel for jieba: filename=jieba-0.42.1-py3-none-any.whl size=19314459 sha256=8ca4c94b1c6311a8c15ad2f4a2c7e18346155853d1cc9296a17c7bbd30322774
Stored in directory: /home/user/.cache/pip/wheels/ca/38/d8/dfdfe73bec1d12026b30cb7ce8da650f3f0ea2cf155ea018ae
Building wheel for ipadic (setup.py) ... done
Created wheel for ipadic: filename=ipadic-1.0.0-py3-none-any.whl size=13556704 sha256=04891c7a7cb8436787944c3daac6cd6f173a7526363890909e8979473bdfa87b
Stored in directory: /home/user/.cache/pip/wheels/45/b7/f5/a21e68db846eedcd00d69e37d60bab3f68eb20b1d99cdff652
Building wheel for mecab-ko-dic (setup.py) ... done
Created wheel for mecab-ko-dic: filename=mecab_ko_dic-1.0.0-py3-none-any.whl size=33424393 sha256=cc897b647cb5c7e5739ddfb20a02c6ce2646bb343c9b32243547ffe42d318cf6
Stored in directory: /home/user/.cache/pip/wheels/c2/c6/6d/d7789f7fb7f60e98ce7febfa26300cd7cf2b88a02a9bb97096
Building wheel for wordfreq (setup.py) ... done
Created wheel for wordfreq: filename=wordfreq-2.5.1-py3-none-any.whl size=56830991 sha256=8382d9739a82ad065fcfe5506237a2ca1838e6916332514527cad06d45d42f31
Stored in directory: /home/user/.cache/pip/wheels/00/85/d7/6f6004757be385f8008965b3d112c1ac88c9837457faecfb31
Successfully built jieba ipadic mecab-ko-dic wordfreq
Installing collected packages: mecab-python3, mecab-ko-dic, jieba, ipadic, wordfreq
Attempting uninstall: wordfreq
Found existing installation: wordfreq 3.0.1
Uninstalling wordfreq-3.0.1:
Successfully uninstalled wordfreq-3.0.1
Successfully installed ipadic-1.0.0 jieba-0.42.1 mecab-ko-dic-1.0.0 mecab-python3-1.0.5 wordfreq-2.5.1
Using pip
version 22.2.2. Repackaging with a newer Poetry might work?
Contrary to the 'no-break space' ("\u00A0"), the 'narrow no-break space' ("\u202f") is not recognized as a word boundary.
tokenize("La vois-tu souvent ?", "fr")
returns ['la', 'vois', 'tu', 'souvent\u202f'] instead of ['la', 'vois', 'tu', 'souvent']
This is a problem because in French, some punctuation signs like ; : ! ? need to have a non breaking space (ideally a narrow one) between them and the word placed before them.
I suppose one solution would be to modify "TOKEN_RE" in the "tokens" module to take this case into account. Unless, of course, this would create undesirable effects in other languages. Another solution could be to replace "\u202f" by "\u00A0" when preprocessing French texts.
Thank you anyway for sharing this library which is for me essential when it comes to identifying the rarest words in a text.
Getting this error when trying to execute as a .py file.
Is it even possible to execute this as a standalone script to, for example, get a quick wordlist?
ImportError: cannot import name 'top_n_list' from partially initialized module 'wordfreq' (most likely due to a circular import) (/home/yaoberh/wordfreq.py)
My proof of this is in the form of a failed Jenkins job, if you want to look at it. But it's pretty clear what's going on -- it needs to be able to find both dictionaries to make mecab.MECAB_ANALYZERS
, even if you're only ever going to want the one.
Installing the dependency marisa-trie runs into this:
C:\Program Files (x86)\Microsoft Visual Studio\2019\BuildTools\VC\Tools\MSVC\14.22.27905\include\crtdefs.h(10): fatal error C1083: Cannot open include file: 'corecrt.h': No such file or directory error: command 'C:\\Program Files (x86)\\Microsoft Visual Studio\\2019\\BuildTools\\VC\\Tools\\MSVC\\14.22.27905\\bin\\HostX86\\x86\\cl.exe' failed with exit status 2 ---------------------------------------- ERROR: Command errored out with exit status 1: 'c:\users\********\appdata\local\programs\python\python38-32\python.exe' -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'C:\\Users\\********\\AppData\\Local\\Temp\\pip-install-w5unt35s\\marisa-trie\\setup.py'"'"'; __file__='"'"'C:\\Users\\********\\AppData\\Local\\Temp\\pip-install-w5unt35s\\marisa-trie\\setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' install --record 'C:\Users\********\AppData\Local\Temp\pip-record-2nqt16fl\install-record.txt' --single-version-externally-managed --compile Check the logs for full command output.
I found this issue in the marisa-trie repo hinting that the latest versions need be built on some platforms, and other issues pointing at problems with Windows 10, and marisa-trie seems to be abandoned. I just need the word frequencies for some words in some languages, that's all; is there perhaps a way to bypass this? Thanks.
P.S.: While the Python package is convenient (if one is a Python user), having the frequencies available in a more universal, platform and language independent format too (e.g. just csv) would be super neat!
This has apparently been the case for a while, but we should fix it in an update:
The tokenize
function assumes it's getting a nicely-normalized language code. But when looking up word frequencies, we don't actually normalize the language code until later, and we do it inside get_frequency_list
without returning it.
I can think of an ugly fix we could make right away, or a nice fix that would require a change to langcodes
to make simple cases of language matching faster.
Here is a big corpus:
http://mokk.bme.hu/en/resources/webcorpus/
Would it be possible to process more languages?
This commit introduced Mypy as a wordfreq
dependency, but there doesn't appear to be any runtime functionality provided by this dependency. I use a newer version of Mypy in one of my projects, which causes the following error when I try to install wordfreq
:
Because wordfreq (3.0.0) depends on mypy (>=0.931,<0.932)
and no versions of wordfreq match >3.0.0,<4.0.0, wordfreq (>=3.0.0,<4.0.0) requires mypy (>=0.931,<0.932).
So, because my-project depends on both mypy (0.941) and wordfreq (^3.0.0), version solving failed.
Since there's no need to include Mypy as a main dependency, could it be moved to the dev dependencies for this project?
In the meantime, I'm using an earlier release of this project (2.5.1), which doesn't have the dependency on Mypy
Thanks for maintaining wordfreq! 😄
Hi, I try to use wordfreq on Japanese on Centos 7. I keep getting an error of Couldn't find the MeCab dictionary named 'mecab-ipadic-utf8'
, however, there's no such package on Centos 7. It's called mecab-ipadic
, how can I run wordfreq on Centos 7 in this case? Thank you so much.
I tokenized an English text that contained short forms like it’ll or you’ve and then got the word frequency for each token. However, for these short forms, the zipf_freq()
function gave me a frequency of zero.
Is this a problem with the character, the tokenizer or the data?
Windows 10, Python 3.9, wordfreq 2.5.0
We tried to standardize the tokenization of French words such as "l'heure" across different versions of mrab's regex
module. The fix assumed that we want this to come out as ["l'", 'heure']
, and to recognize this pattern as one or two letters, an apostrophe, and a vowel.
A similar pattern appears in Italian, but it can have four characters before the apostrophe.
On regex 2020.4.4, we get this tokenization:
>>> wordfreq.tokenize("nell'obolo", 'it')
["nell'obolo"]
But on regex 2018.2.21, we get:
>>> wordfreq.tokenize("nell'obolo", 'it')
["nell", "obolo"]
This should be standardized as well. If we insist on a particular version of regex, we will probably cause conflicts with other libraries such as spacy.
Hi!
Thanks for this great application! I just wanted to tell you that the ISO 639-1 code for Slovenian is sl
and the code for Slovak is sk
.
I think you got that mixed-up in the table in your README.md.
Yes, the two countries have very similar flags, names, and languages :)
Best regards
Hi! I noticed you've added versions 2.4 and 2.4.1 to the changelog:
https://github.com/LuminosoInsight/wordfreq/blob/master/CHANGELOG.md
But not to PyPI
https://pypi.org/project/wordfreq/#history
Would it be possible to uploade them to PyPI?
Hey there,
after pip installing the newest version, I constantly get this error:
>>> from wordfreq import word_frequency
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/xujinghua/miniconda3/lib/python3.7/site-packages/wordfreq/__init__.py", line 16, in <module>
from .numbers import digit_freq, has_digit_sequence, smash_numbers
File "/Users/xujinghua/miniconda3/lib/python3.7/site-packages/wordfreq/numbers.py", line 99, in <module>
def _sub_zeroes(match: regex.Match) -> str:
AttributeError: module 'regex' has no attribute 'Match'
whereas wordfreq 1.4
works fine. Does anyone know how to fix this? Thanks a lot for your help in advance!
Cheers,
Xu
Is there a way to use custom word lists? Say if I wanted to know the frequency of the word "whale" in the text of Moby Dick. I would have thought such a task would be within the scope of this library, yet I can't find anything in the documentation about it.
I realise that I could use the tokenize
method combined with something like collections.Counter
, but that would seem to somewhat defeat the purpose.
I've tried the following but to no avail:
with Path("moby_dick.txt").open() as f:
moby_dick = f.read()
tokenized = tokenize(moby_dick, "en")
whale_freq = word_frequency("whale", "en", wordlist=tokenized)
print("whale_freq:", whale_freq)
with Path("moby_dick.txt").open() as f:
moby_dick = f.read()
whale_freq = word_frequency("whale", "en", wordlist=moby_dick)
print("whale_freq:", whale_freq)
wordfreq raises warnings because it's using an obsolete parameter to msgpack, encoding='utf-8'
. It should be updated to raw=False
.
Hi,
Would it be possible to publish an updated version of wordfreq on Zenodo, to enable referencing the latest version?
Currently latest version on Zenodo is v2.2.
Thank you for this great package 🥇
I'm interested in using wordfreq in a project I'm doing with spaCy; unfortunately, both wordfreq and spaCy require different versions of regex at an exact version number. Would it be possible to loosen the requirements on the regex version needed?
Is it possible to get a character list for each language ordered by frequency?
Going through the top_n_list
in several languages shows numbers such as 00
or 0000
as extremely frequent.
Is this expected behavior? These numbers are clearly as not as common as the library says they are, is there a way to remove these numbers from the function's return so they don't skew the result?
>>> top_n_list('en', 30)
['the', 'of', 'to', 'and', 'a', 'in', 'i', 'is', 'that', 'for', 'you', 'it', 'on', '00', 'with', 'was', 'be', 'this', 'as', 'are', 'not', 'have', 'at', 'he', 'by', 'from', 'but', '0000', 'my', 'an']
>>> top_n_list('es', 30)
['de', 'la', 'que', 'el', 'en', 'y', 'a', 'los', 'no', 'se', 'un', 'por', 'del', 'es', 'las', 'con', 'una', 'para', 'lo', 'su', 'al', '00', 'como', 'me', 'más', 'si', 'pero', 'o', '0000', 'te']
>>> top_n_list('it', 30)
['di', 'e', 'che', 'il', 'la', 'in', 'a', 'non', 'un', 'per', 'è', 'del', 'l', 'i', 'una', 'le', 'si', 'della', 'con', 'da', '00', 'sono', 'ma', 'al', 'come', 'ha', 'più', 'dei', 'se', 'nel']
Would be wonderful to be able to get frequencies for the different sources!
I tried with pip3 and pip as well. Python is set up correctly.
pip3 install wordfreq[cjk] zsh: no matches found: wordfreq[cjk]
mecab-python3
itself doesn't recommend ipadic
anymore.
In order to use MeCab, you must install a dictionary. There are many different dictionaries available for MeCab. These UniDic packages, which include slight modifications for ease of use, are recommended:
- unidic: The latest full UniDic.
- unidic-lite: A slightly modified UniDic 2.1.2, chosen for its small size.
The dictionaries below are not recommended due to being unmaintained for many years, but they are available for use with legacy applications.
For more details on the differences between dictionaries see here.
Furthermore, other tokenizers might also be considered. (but a little out of scope, and can create more confusion, perhaps.)
/wordfreq/numbers.py", line 99, in
def _sub_zeroes(match: regex.Match) -> str:
AttributeError: module 'regex' has no attribute 'Match'
Hi, there I got this error, but I got it solved by going into the file "numbers.py" and change
import regex
into:
import re as regex
Did I do the right thing?
There is a project under Unicode License (GNU like) called Unilex and gathering frequency for 1000 languages. They are based on Google's corpuscrawler, which is python and handfeed links to wordpresses, bible translations, etc. Both projects are on Github.
I am very grateful for the current minimum
argument: it has helped me on many occasions. However, I am wondering if it would be possible for minimum
to also accept a function/lamda
, since this would simplify a lot of things? For instance, something like:
zipf_frequency("dog", "en", minimum=lambda word: math.log10(len(word)))
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.