rspeer / wordfreq Goto Github PK

View Code? Open in Web Editor NEW

675.0 675.0 57.0 442.03 MB

Access a database of word frequencies, in various natural languages.

License: Other

Python 100.00%

wordfreq's People

Contributors

Stargazers

Watchers

wordfreq's Issues

Adding new language "Basque"

Hello my name is Mikel,
I would like to know if there is any possibility of adding a new language to the library. The Basque language.
And if the answer is yes how could i colaborate to make it happen.
Thank you

Install fails

I'm getting the following error when I try to install:

src/marisa_trie.cpp:17944:34: error: ‘PyTypeObject’ {aka ‘struct _typeobject’} h
as no member named ‘tp_print’
17944 |   __pyx_type_11marisa_trie__Trie.tp_print = 0;
      |                                  ^~~~~~~~
src/marisa_trie.cpp:17968:39: error: ‘PyTypeObject’ {aka ‘struct _typeobject’} h
as no member named ‘tp_print’
17968 |   __pyx_type_11marisa_trie_BinaryTrie.tp_print = 0;
      |                                       ^~~~~~~~
src/marisa_trie.cpp:17981:46: error: ‘PyTypeObject’ {aka ‘struct _typeobject’} h
as no member named ‘tp_print’
17981 |   __pyx_type_11marisa_trie__UnicodeKeyedTrie.tp_print = 0;
      |                                              ^~~~~~~~
src/marisa_trie.cpp:17995:33: error: ‘PyTypeObject’ {aka ‘struct _typeobject’} h
as no member named ‘tp_print’
17995 |   __pyx_type_11marisa_trie_Trie.tp_print = 0;
      |                                 ^~~~~~~~
src/marisa_trie.cpp:18014:38: error: ‘PyTypeObject’ {aka ‘struct _typeobject’} h
as no member named ‘tp_print’
18014 |   __pyx_type_11marisa_trie_BytesTrie.tp_print = 0;
      |                                      ^~~~~~~~
src/marisa_trie.cpp:18039:40: error: ‘PyTypeObject’ {aka ‘struct _typeobject’} h
as no member named ‘tp_print’
18039 |   __pyx_type_11marisa_trie__UnpackTrie.tp_print = 0;
      |                                        ^~~~~~~~
src/marisa_trie.cpp:18052:39: error: ‘PyTypeObject’ {aka ‘struct _typeobject’} h
as no member named ‘tp_print’
18052 |   __pyx_type_11marisa_trie_RecordTrie.tp_print = 0;
      |                                       ^~~~~~~~
src/marisa_trie.cpp:18070:57: error: ‘PyTypeObject’ {aka ‘struct _typeobject’} h
as no member named ‘tp_print’
18070 |   __pyx_type_11marisa_trie___pyx_scope_struct____init__.tp_print = 0;
      |                                                         ^~~~~~~~
src/marisa_trie.cpp:18076:57: error: ‘PyTypeObject’ {aka ‘struct _typeobject’} h
as no member named ‘tp_print’
18076 |   __pyx_type_11marisa_trie___pyx_scope_struct_1_genexpr.tp_print = 0;
      |                                                         ^~~~~~~~
src/marisa_trie.cpp:18082:58: error: ‘PyTypeObject’ {aka ‘struct _typeobject’} h
as no member named ‘tp_print’
18082 |   __pyx_type_11marisa_trie___pyx_scope_struct_2_iterkeys.tp_print = 0;
      |                                                          ^~~~~~~~
src/marisa_trie.cpp:18088:63: error: ‘PyTypeObject’ {aka ‘struct _typeobject’} h
as no member named ‘tp_print’
18088 | __pyx_type_11marisa_trie___pyx_scope_struct_3_iter_prefixes.tp_print = 0
;
      |                                                             ^~~~~~~~

src/marisa_trie.cpp:18094:59: error: ‘PyTypeObject’ {aka ‘struct _typeobject’} h
as no member named ‘tp_print’
18094 |   __pyx_type_11marisa_trie___pyx_scope_struct_4_iteritems.tp_print = 0;
      |                                                           ^~~~~~~~
src/marisa_trie.cpp:18100:63: error: ‘PyTypeObject’ {aka ‘struct _typeobject’} h
as no member named ‘tp_print’
18100 | __pyx_type_11marisa_trie___pyx_scope_struct_5_iter_prefixes.tp_print = 0
;
      |                                                             ^~~~~~~~

src/marisa_trie.cpp:18106:59: error: ‘PyTypeObject’ {aka ‘struct _typeobject’} h
as no member named ‘tp_print’
18106 |   __pyx_type_11marisa_trie___pyx_scope_struct_6_iteritems.tp_print = 0;
      |                                                           ^~~~~~~~
src/marisa_trie.cpp:18112:58: error: ‘PyTypeObject’ {aka ‘struct _typeobject’} h
as no member named ‘tp_print’
18112 |   __pyx_type_11marisa_trie___pyx_scope_struct_7___init__.tp_print = 0;
      |                                                          ^~~~~~~~
src/marisa_trie.cpp:18118:57: error: ‘PyTypeObject’ {aka ‘struct _typeobject’} h
as no member named ‘tp_print’
18118 |   __pyx_type_11marisa_trie___pyx_scope_struct_8_genexpr.tp_print = 0;
      |                                                         ^~~~~~~~
src/marisa_trie.cpp:18124:59: error: ‘PyTypeObject’ {aka ‘struct _typeobject’} h
as no member named ‘tp_print’
18124 |   __pyx_type_11marisa_trie___pyx_scope_struct_9_iteritems.tp_print = 0;
      |                                                           ^~~~~~~~
src/marisa_trie.cpp:18130:59: error: ‘PyTypeObject’ {aka ‘struct _typeobject’} h
as no member named ‘tp_print’
18130 |   __pyx_type_11marisa_trie___pyx_scope_struct_10_iterkeys.tp_print = 0;
      |                                                           ^~~~~~~~
src/marisa_trie.cpp:18136:59: error: ‘PyTypeObject’ {aka ‘struct _typeobject’} h
as no member named ‘tp_print’
18136 |   __pyx_type_11marisa_trie___pyx_scope_struct_11___init__.tp_print = 0;
      |                                                           ^~~~~~~~
src/marisa_trie.cpp:18142:58: error: ‘PyTypeObject’ {aka ‘struct _typeobject’} h
as no member named ‘tp_print’
18142 |   __pyx_type_11marisa_trie___pyx_scope_struct_12_genexpr.tp_print = 0;
      |                                                          ^~~~~~~~
src/marisa_trie.cpp:18148:60: error: ‘PyTypeObject’ {aka ‘struct _typeobject’} h
as no member named ‘tp_print’
18148 |   __pyx_type_11marisa_trie___pyx_scope_struct_13_iteritems.tp_print = 0;
      |                                                            ^~~~~~~~
src/marisa_trie.cpp:18154:58: error: ‘PyTypeObject’ {aka ‘struct _typeobject’} h
as no member named ‘tp_print’
18154 |   __pyx_type_11marisa_trie___pyx_scope_struct_14_genexpr.tp_print = 0;
      |                                                          ^~~~~~~~
In file included from /usr/include/python3.9/unicodeobject.h:1026,
                 from /usr/include/python3.9/Python.h:97,
                 from src/marisa_trie.cpp:4:
src/marisa_trie.cpp: In function ‘int __Pyx_ParseOptionalKeywords(PyObject*, PyO
bject***, PyObject*, PyObject**, Py_ssize_t, const char*)’:
/usr/include/python3.9/cpython/unicodeobject.h:451:75: warning: ‘Py_ssize_t _PyU
nicode_get_wstr_length(PyObject*)’ is deprecated [-Wdeprecated-declarations]
  451 | ine PyUnicode_WSTR_LENGTH(op) _PyUnicode_get_wstr_length((PyObject*)op)
      |                                                                       ^

/usr/include/python3.9/cpython/unicodeobject.h:261:7: note: in expansion of macr
o ‘PyUnicode_WSTR_LENGTH’
  261 |       PyUnicode_WSTR_LENGTH(op) :                    \
      |       ^~~~~~~~~~~~~~~~~~~~~
src/marisa_trie.cpp:18940:22: note: in expansion of macro ‘PyUnicode_GET_SIZE’
18940 |                     (PyUnicode_GET_SIZE(**name) != PyUnicode_GET_SIZE(ke
y)) ? 1 :
      |                      ^~~~~~~~~~~~~~~~~~
/usr/include/python3.9/cpython/unicodeobject.h:446:26: note: declared here
  446 | static inline Py_ssize_t _PyUnicode_get_wstr_length(PyObject *op) {
      |                          ^~~~~~~~~~~~~~~~~~~~~~~~~~
/usr/include/python3.9/cpython/unicodeobject.h:262:52: warning: ‘Py_UNICODE* PyU
nicode_AsUnicode(PyObject*)’ is deprecated [-Wdeprecated-declarations]
  262 |       ((void)PyUnicode_AsUnicode(_PyObject_CAST(op)),\
      |                                                    ^
src/marisa_trie.cpp:18940:22: note: in expansion of macro ‘PyUnicode_GET_SIZE’
18940 |                     (PyUnicode_GET_SIZE(**name) != PyUnicode_GET_SIZE(ke
y)) ? 1 :
      |                      ^~~~~~~~~~~~~~~~~~
/usr/include/python3.9/cpython/unicodeobject.h:580:45: note: declared here
  580 | Py_DEPRECATED(3.3) PyAPI_FUNC(Py_UNICODE *) PyUnicode_AsUnicode(
      |                                             ^~~~~~~~~~~~~~~~~~~
/usr/include/python3.9/cpython/unicodeobject.h:451:75: warning: ‘Py_ssize_t _PyU
nicode_get_wstr_length(PyObject*)’ is deprecated [-Wdeprecated-declarations]
  451 | ine PyUnicode_WSTR_LENGTH(op) _PyUnicode_get_wstr_length((PyObject*)op)
      |                                                                       ^

/usr/include/python3.9/cpython/unicodeobject.h:264:8: note: in expansion of macr
o ‘PyUnicode_WSTR_LENGTH’
  264 |        PyUnicode_WSTR_LENGTH(op)))
      |        ^~~~~~~~~~~~~~~~~~~~~
src/marisa_trie.cpp:18940:22: note: in expansion of macro ‘PyUnicode_GET_SIZE’
18940 |                     (PyUnicode_GET_SIZE(**name) != PyUnicode_GET_SIZE(ke
y)) ? 1 :
      |                      ^~~~~~~~~~~~~~~~~~
/usr/include/python3.9/cpython/unicodeobject.h:446:26: note: declared here
  446 | static inline Py_ssize_t _PyUnicode_get_wstr_length(PyObject *op) {
      |                          ^~~~~~~~~~~~~~~~~~~~~~~~~~
/usr/include/python3.9/cpython/unicodeobject.h:451:75: warning: ‘Py_ssize_t _PyU
nicode_get_wstr_length(PyObject*)’ is deprecated [-Wdeprecated-declarations]
  451 | ine PyUnicode_WSTR_LENGTH(op) _PyUnicode_get_wstr_length((PyObject*)op)
      |                                                                       ^

/usr/include/python3.9/cpython/unicodeobject.h:261:7: note: in expansion of macr
o ‘PyUnicode_WSTR_LENGTH’
  261 |       PyUnicode_WSTR_LENGTH(op) :                    \
      |       ^~~~~~~~~~~~~~~~~~~~~
src/marisa_trie.cpp:18940:52: note: in expansion of macro ‘PyUnicode_GET_SIZE’
18940 |                     (PyUnicode_GET_SIZE(**name) != PyUnicode_GET_SIZE(ke
y)) ? 1 :
      |                                                    ^~~~~~~~~~~~~~~~~~
/usr/include/python3.9/cpython/unicodeobject.h:446:26: note: declared here
  446 | static inline Py_ssize_t _PyUnicode_get_wstr_length(PyObject *op) {
      |                          ^~~~~~~~~~~~~~~~~~~~~~~~~~
/usr/include/python3.9/cpython/unicodeobject.h:262:52: warning: ‘Py_UNICODE* PyU
nicode_AsUnicode(PyObject*)’ is deprecated [-Wdeprecated-declarations]
  262 |       ((void)PyUnicode_AsUnicode(_PyObject_CAST(op)),\
      |                                                    ^
src/marisa_trie.cpp:18940:52: note: in expansion of macro ‘PyUnicode_GET_SIZE’
18940 |                     (PyUnicode_GET_SIZE(**name) != PyUnicode_GET_SIZE(ke
y)) ? 1 :
      |                                                    ^~~~~~~~~~~~~~~~~~
/usr/include/python3.9/cpython/unicodeobject.h:580:45: note: declared here
  580 | Py_DEPRECATED(3.3) PyAPI_FUNC(Py_UNICODE *) PyUnicode_AsUnicode(
      |                                             ^~~~~~~~~~~~~~~~~~~
/usr/include/python3.9/cpython/unicodeobject.h:451:75: warning: ‘Py_ssize_t _PyU
nicode_get_wstr_length(PyObject*)’ is deprecated [-Wdeprecated-declarations]
  451 | ine PyUnicode_WSTR_LENGTH(op) _PyUnicode_get_wstr_length((PyObject*)op)
      |                                                                       ^

/usr/include/python3.9/cpython/unicodeobject.h:264:8: note: in expansion of macr
o ‘PyUnicode_WSTR_LENGTH’
  264 |        PyUnicode_WSTR_LENGTH(op)))
      |        ^~~~~~~~~~~~~~~~~~~~~
src/marisa_trie.cpp:18940:52: note: in expansion of macro ‘PyUnicode_GET_SIZE’
18940 |                     (PyUnicode_GET_SIZE(**name) != PyUnicode_GET_SIZE(ke
y)) ? 1 :
      |                                                    ^~~~~~~~~~~~~~~~~~
/usr/include/python3.9/cpython/unicodeobject.h:446:26: note: declared here
  446 | static inline Py_ssize_t _PyUnicode_get_wstr_length(PyObject *op) {
      |                          ^~~~~~~~~~~~~~~~~~~~~~~~~~
/usr/include/python3.9/cpython/unicodeobject.h:451:75: warning: ‘Py_ssize_t _PyU
nicode_get_wstr_length(PyObject*)’ is deprecated [-Wdeprecated-declarations]
  451 | ine PyUnicode_WSTR_LENGTH(op) _PyUnicode_get_wstr_length((PyObject*)op)
      |                                                                       ^

/usr/include/python3.9/cpython/unicodeobject.h:261:7: note: in expansion of macr
o ‘PyUnicode_WSTR_LENGTH’
  261 |       PyUnicode_WSTR_LENGTH(op) :                    \
      |       ^~~~~~~~~~~~~~~~~~~~~
src/marisa_trie.cpp:18956:26: note: in expansion of macro ‘PyUnicode_GET_SIZE’
18956 |                         (PyUnicode_GET_SIZE(**argname) != PyUnicode_GET_
SIZE(key)) ? 1 :
      |                          ^~~~~~~~~~~~~~~~~~
/usr/include/python3.9/cpython/unicodeobject.h:446:26: note: declared here
  446 | static inline Py_ssize_t _PyUnicode_get_wstr_length(PyObject *op) {
      |                          ^~~~~~~~~~~~~~~~~~~~~~~~~~
/usr/include/python3.9/cpython/unicodeobject.h:262:52: warning: ‘Py_UNICODE* PyU
nicode_AsUnicode(PyObject*)’ is deprecated [-Wdeprecated-declarations]
  262 |       ((void)PyUnicode_AsUnicode(_PyObject_CAST(op)),\
      |                                                    ^
src/marisa_trie.cpp:18956:26: note: in expansion of macro ‘PyUnicode_GET_SIZE’
18956 |                         (PyUnicode_GET_SIZE(**argname) != PyUnicode_GET_
SIZE(key)) ? 1 :
      |                          ^~~~~~~~~~~~~~~~~~
/usr/include/python3.9/cpython/unicodeobject.h:580:45: note: declared here
  580 | Py_DEPRECATED(3.3) PyAPI_FUNC(Py_UNICODE *) PyUnicode_AsUnicode(
      |                                             ^~~~~~~~~~~~~~~~~~~
/usr/include/python3.9/cpython/unicodeobject.h:451:75: warning: ‘Py_ssize_t _PyU
nicode_get_wstr_length(PyObject*)’ is deprecated [-Wdeprecated-declarations]
  451 | ine PyUnicode_WSTR_LENGTH(op) _PyUnicode_get_wstr_length((PyObject*)op)
      |                                                                       ^

/usr/include/python3.9/cpython/unicodeobject.h:264:8: note: in expansion of macr
o ‘PyUnicode_WSTR_LENGTH’
  264 |        PyUnicode_WSTR_LENGTH(op)))
      |        ^~~~~~~~~~~~~~~~~~~~~
src/marisa_trie.cpp:18956:26: note: in expansion of macro ‘PyUnicode_GET_SIZE’
18956 |                         (PyUnicode_GET_SIZE(**argname) != PyUnicode_GET_
SIZE(key)) ? 1 :
      |                          ^~~~~~~~~~~~~~~~~~
/usr/include/python3.9/cpython/unicodeobject.h:446:26: note: declared here
  446 | static inline Py_ssize_t _PyUnicode_get_wstr_length(PyObject *op) {
      |                          ^~~~~~~~~~~~~~~~~~~~~~~~~~
/usr/include/python3.9/cpython/unicodeobject.h:451:75: warning: ‘Py_ssize_t _PyU
nicode_get_wstr_length(PyObject*)’ is deprecated [-Wdeprecated-declarations]
  451 | ine PyUnicode_WSTR_LENGTH(op) _PyUnicode_get_wstr_length((PyObject*)op)
      |                                                                       ^

/usr/include/python3.9/cpython/unicodeobject.h:261:7: note: in expansion of macr
o ‘PyUnicode_WSTR_LENGTH’
  261 |       PyUnicode_WSTR_LENGTH(op) :                    \
      |       ^~~~~~~~~~~~~~~~~~~~~
src/marisa_trie.cpp:18956:59: note: in expansion of macro ‘PyUnicode_GET_SIZE’
18956 |                         (PyUnicode_GET_SIZE(**argname) != PyUnicode_GET_
SIZE(key)) ? 1 :
      |                                                           ^~~~~~~~~~~~~~
~~~~
/usr/include/python3.9/cpython/unicodeobject.h:446:26: note: declared here
  446 | static inline Py_ssize_t _PyUnicode_get_wstr_length(PyObject *op) {
      |                          ^~~~~~~~~~~~~~~~~~~~~~~~~~
/usr/include/python3.9/cpython/unicodeobject.h:262:52: warning: ‘Py_UNICODE* PyU
nicode_AsUnicode(PyObject*)’ is deprecated [-Wdeprecated-declarations]
  262 |       ((void)PyUnicode_AsUnicode(_PyObject_CAST(op)),\
      |                                                    ^
src/marisa_trie.cpp:18956:59: note: in expansion of macro ‘PyUnicode_GET_SIZE’
18956 |                         (PyUnicode_GET_SIZE(**argname) != PyUnicode_GET_
SIZE(key)) ? 1 :
      |                                                           ^~~~~~~~~~~~~~
~~~~
/usr/include/python3.9/cpython/unicodeobject.h:580:45: note: declared here
  580 | Py_DEPRECATED(3.3) PyAPI_FUNC(Py_UNICODE *) PyUnicode_AsUnicode(
      |                                             ^~~~~~~~~~~~~~~~~~~
/usr/include/python3.9/cpython/unicodeobject.h:451:75: warning: ‘Py_ssize_t _PyU
nicode_get_wstr_length(PyObject*)’ is deprecated [-Wdeprecated-declarations]
  451 | ine PyUnicode_WSTR_LENGTH(op) _PyUnicode_get_wstr_length((PyObject*)op)
      |                                                                       ^

/usr/include/python3.9/cpython/unicodeobject.h:264:8: note: in expansion of macr
o ‘PyUnicode_WSTR_LENGTH’
  264 |        PyUnicode_WSTR_LENGTH(op)))
      |        ^~~~~~~~~~~~~~~~~~~~~
src/marisa_trie.cpp:18956:59: note: in expansion of macro ‘PyUnicode_GET_SIZE’
18956 |                         (PyUnicode_GET_SIZE(**argname) != PyUnicode_GET_
SIZE(key)) ? 1 :
      |                                                           ^~~~~~~~~~~~~~
~~~~
/usr/include/python3.9/cpython/unicodeobject.h:446:26: note: declared here
  446 | static inline Py_ssize_t _PyUnicode_get_wstr_length(PyObject *op) {
      |                          ^~~~~~~~~~~~~~~~~~~~~~~~~~
src/marisa_trie.cpp: In function ‘PyObject* __Pyx_PyUnicode_Substring(PyObject*,
 Py_ssize_t, Py_ssize_t)’:
src/marisa_trie.cpp:20109:45: warning: ‘PyObject* PyUnicode_FromUnicode(const Py
_UNICODE*, Py_ssize_t)’ is deprecated [-Wdeprecated-declarations]
20109 |         return PyUnicode_FromUnicode(NULL, 0);
      |                                             ^
In file included from /usr/include/python3.9/unicodeobject.h:1026,
                 from /usr/include/python3.9/Python.h:97,
                 from src/marisa_trie.cpp:4:
/usr/include/python3.9/cpython/unicodeobject.h:551:42: note: declared here
  551 | Py_DEPRECATED(3.3) PyAPI_FUNC(PyObject*) PyUnicode_FromUnicode(
      |                                          ^~~~~~~~~~~~~~~~~~~~~
src/marisa_trie.cpp: In function ‘PyObject* __Pyx_decode_c_string(const char*, P
y_ssize_t, Py_ssize_t, const char*, const char*, PyObject* (*)(const char*, Py_s
size_t, const char*))’:
src/marisa_trie.cpp:20142:45: warning: ‘PyObject* PyUnicode_FromUnicode(const Py
_UNICODE*, Py_ssize_t)’ is deprecated [-Wdeprecated-declarations]
20142 |         return PyUnicode_FromUnicode(NULL, 0);
      |                                             ^
In file included from /usr/include/python3.9/unicodeobject.h:1026,
                 from /usr/include/python3.9/Python.h:97,
                 from src/marisa_trie.cpp:4:
/usr/include/python3.9/cpython/unicodeobject.h:551:42: note: declared here
  551 | Py_DEPRECATED(3.3) PyAPI_FUNC(PyObject*) PyUnicode_FromUnicode(
      |                                          ^~~~~~~~~~~~~~~~~~~~~
error: Setup script exited with error: command '/usr/bin/gcc' failed with exit c
ode 1

Czech and Slovak

We already have enough data in Czech and Slovak to build wordfreq lists for them. They should be added to the next version.

Other languages that could make it: Persian, Slovenian, Estonian.

so, can we training from private corpus?

pip install wordfreq[cjk] does not get extras

But pip install wordfreq[cjk]==2.5.1 does get the extras:

$ pip install wordfreq[cjk]
Collecting wordfreq[cjk]
  Downloading wordfreq-3.0.1-py3-none-any.whl (56.8 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 56.8/56.8 MB 9.7 MB/s eta 0:00:00
Collecting ftfy>=6.1
  Downloading ftfy-6.1.1-py3-none-any.whl (53 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 53.1/53.1 kB 13.6 MB/s eta 0:00:00
Collecting langcodes>=3.0
  Downloading langcodes-3.3.0-py3-none-any.whl (181 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 181.6/181.6 kB 19.4 MB/s eta 0:00:00
Collecting msgpack>=1.0
  Downloading msgpack-1.0.4-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (322 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 322.5/322.5 kB 18.8 MB/s eta 0:00:00
Collecting regex>=2020.04.04
  Downloading regex-2022.9.13-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (772 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 772.3/772.3 kB 14.7 MB/s eta 0:00:00
Collecting wcwidth>=0.2.5
  Downloading wcwidth-0.2.5-py2.py3-none-any.whl (30 kB)
Installing collected packages: wcwidth, msgpack, regex, langcodes, ftfy, wordfreq
Successfully installed ftfy-6.1.1 langcodes-3.3.0 msgpack-1.0.4 regex-2022.9.13 wcwidth-0.2.5 wordfreq-3.0.1
$ pip install wordfreq[cjk]==2.5.1
Collecting wordfreq[cjk]==2.5.1
  Downloading wordfreq-2.5.1.tar.gz (56.8 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 56.8/56.8 MB 11.7 MB/s eta 0:00:00
  Preparing metadata (setup.py) ... done
Requirement already satisfied: msgpack>=1.0 in ./venv/lib/python3.8/site-packages (from wordfreq[cjk]==2.5.1) (1.0.4)
Requirement already satisfied: langcodes>=3.0 in ./venv/lib/python3.8/site-packages (from wordfreq[cjk]==2.5.1) (3.3.0)
Requirement already satisfied: regex>=2020.04.04 in ./venv/lib/python3.8/site-packages (from wordfreq[cjk]==2.5.1) (2022.9.13)
Requirement already satisfied: ftfy>=3.0 in ./venv/lib/python3.8/site-packages (from wordfreq[cjk]==2.5.1) (6.1.1)
Collecting mecab-python3
  Downloading mecab_python3-1.0.5-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (577 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 577.3/577.3 kB 15.4 MB/s eta 0:00:00
Collecting ipadic
  Downloading ipadic-1.0.0.tar.gz (13.4 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 13.4/13.4 MB 15.2 MB/s eta 0:00:00
  Preparing metadata (setup.py) ... done
Collecting mecab-ko-dic
  Downloading mecab-ko-dic-1.0.0.tar.gz (33.2 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 33.2/33.2 MB 13.7 MB/s eta 0:00:00
  Preparing metadata (setup.py) ... done
Collecting jieba>=0.42
  Downloading jieba-0.42.1.tar.gz (19.2 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 19.2/19.2 MB 16.1 MB/s eta 0:00:00
  Preparing metadata (setup.py) ... done
Requirement already satisfied: wcwidth>=0.2.5 in ./venv/lib/python3.8/site-packages (from ftfy>=3.0->wordfreq[cjk]==2.5.1) (0.2.5)
Building wheels for collected packages: jieba, ipadic, mecab-ko-dic, wordfreq
  Building wheel for jieba (setup.py) ... done
  Created wheel for jieba: filename=jieba-0.42.1-py3-none-any.whl size=19314459 sha256=8ca4c94b1c6311a8c15ad2f4a2c7e18346155853d1cc9296a17c7bbd30322774
  Stored in directory: /home/user/.cache/pip/wheels/ca/38/d8/dfdfe73bec1d12026b30cb7ce8da650f3f0ea2cf155ea018ae
  Building wheel for ipadic (setup.py) ... done
  Created wheel for ipadic: filename=ipadic-1.0.0-py3-none-any.whl size=13556704 sha256=04891c7a7cb8436787944c3daac6cd6f173a7526363890909e8979473bdfa87b
  Stored in directory: /home/user/.cache/pip/wheels/45/b7/f5/a21e68db846eedcd00d69e37d60bab3f68eb20b1d99cdff652
  Building wheel for mecab-ko-dic (setup.py) ... done
  Created wheel for mecab-ko-dic: filename=mecab_ko_dic-1.0.0-py3-none-any.whl size=33424393 sha256=cc897b647cb5c7e5739ddfb20a02c6ce2646bb343c9b32243547ffe42d318cf6
  Stored in directory: /home/user/.cache/pip/wheels/c2/c6/6d/d7789f7fb7f60e98ce7febfa26300cd7cf2b88a02a9bb97096
  Building wheel for wordfreq (setup.py) ... done
  Created wheel for wordfreq: filename=wordfreq-2.5.1-py3-none-any.whl size=56830991 sha256=8382d9739a82ad065fcfe5506237a2ca1838e6916332514527cad06d45d42f31
  Stored in directory: /home/user/.cache/pip/wheels/00/85/d7/6f6004757be385f8008965b3d112c1ac88c9837457faecfb31
Successfully built jieba ipadic mecab-ko-dic wordfreq
Installing collected packages: mecab-python3, mecab-ko-dic, jieba, ipadic, wordfreq
  Attempting uninstall: wordfreq
    Found existing installation: wordfreq 3.0.1
    Uninstalling wordfreq-3.0.1:
      Successfully uninstalled wordfreq-3.0.1
Successfully installed ipadic-1.0.0 jieba-0.42.1 mecab-ko-dic-1.0.0 mecab-python3-1.0.5 wordfreq-2.5.1

Using pip version 22.2.2. Repackaging with a newer Poetry might work?

'narrow no-break space' ("\u202f) is not recognized as a word boundary

Contrary to the 'no-break space' ("\u00A0"), the 'narrow no-break space' ("\u202f") is not recognized as a word boundary.

tokenize("La vois-tu souvent ?", "fr")
returns ['la', 'vois', 'tu', 'souvent\u202f'] instead of ['la', 'vois', 'tu', 'souvent']

This is a problem because in French, some punctuation signs like ; : ! ? need to have a non breaking space (ideally a narrow one) between them and the word placed before them.

I suppose one solution would be to modify "TOKEN_RE" in the "tokens" module to take this case into account. Unless, of course, this would create undesirable effects in other languages. Another solution could be to replace "\u202f" by "\u00A0" when preprocessing French texts.

Thank you anyway for sharing this library which is for me essential when it comes to identifying the rarest words in a text.

Cannot import name top_n_list

Getting this error when trying to execute as a .py file.
Is it even possible to execute this as a standalone script to, for example, get a quick wordlist?

ImportError: cannot import name 'top_n_list' from partially initialized module 'wordfreq' (most likely due to a circular import) (/home/yaoberh/wordfreq.py)

Both Korean and Japanese dictionaries must be installed to use either

My proof of this is in the form of a failed Jenkins job, if you want to look at it. But it's pretty clear what's going on -- it needs to be able to find both dictionaries to make mecab.MECAB_ANALYZERS, even if you're only ever going to want the one.

Cannot install on Windows 10, marisa-trie dependency error; plaintext data possible?

Installing the dependency marisa-trie runs into this:

C:\Program Files (x86)\Microsoft Visual Studio\2019\BuildTools\VC\Tools\MSVC\14.22.27905\include\crtdefs.h(10): fatal error C1083: Cannot open include file: 'corecrt.h': No such file or directory error: command 'C:\\Program Files (x86)\\Microsoft Visual Studio\\2019\\BuildTools\\VC\\Tools\\MSVC\\14.22.27905\\bin\\HostX86\\x86\\cl.exe' failed with exit status 2 ---------------------------------------- ERROR: Command errored out with exit status 1: 'c:\users\********\appdata\local\programs\python\python38-32\python.exe' -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'C:\\Users\\********\\AppData\\Local\\Temp\\pip-install-w5unt35s\\marisa-trie\\setup.py'"'"'; __file__='"'"'C:\\Users\\********\\AppData\\Local\\Temp\\pip-install-w5unt35s\\marisa-trie\\setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' install --record 'C:\Users\********\AppData\Local\Temp\pip-record-2nqt16fl\install-record.txt' --single-version-externally-managed --compile Check the logs for full command output.

I found this issue in the marisa-trie repo hinting that the latest versions need be built on some platforms, and other issues pointing at problems with Windows 10, and marisa-trie seems to be abandoned. I just need the word frequencies for some words in some languages, that's all; is there perhaps a way to bypass this? Thanks.

P.S.: While the Python package is convenient (if one is a Python user), having the frequencies available in a more universal, platform and language independent format too (e.g. just csv) would be super neat!

Inconsistent language-code strings lead to inconsistent normalization

This has apparently been the case for a while, but we should fix it in an update:

The tokenize function assumes it's getting a nicely-normalized language code. But when looking up word frequencies, we don't actually normalize the language code until later, and we do it inside get_frequency_list without returning it.

I can think of an ugly fix we could make right away, or a nice fix that would require a change to langcodes to make simple cases of language matching faster.

Can the license file be packaged in as well?

Hungarian

Here is a big corpus:
http://mokk.bme.hu/en/resources/webcorpus/

Would it be possible to process more languages?

📦 Unnecessary dependency on Mypy

This commit introduced Mypy as a wordfreq dependency, but there doesn't appear to be any runtime functionality provided by this dependency. I use a newer version of Mypy in one of my projects, which causes the following error when I try to install wordfreq:

Because wordfreq (3.0.0) depends on mypy (>=0.931,<0.932)
   and no versions of wordfreq match >3.0.0,<4.0.0, wordfreq (>=3.0.0,<4.0.0) requires mypy (>=0.931,<0.932).
  So, because my-project depends on both mypy (0.941) and wordfreq (^3.0.0), version solving failed.

Since there's no need to include Mypy as a main dependency, could it be moved to the dev dependencies for this project?

In the meantime, I'm using an earlier release of this project (2.5.1), which doesn't have the dependency on Mypy

Thanks for maintaining wordfreq! 😄

No `mecab-ipadic-utf8` on centos 7, how can I use wordfreq on Japanese in this case?

Hi, I try to use wordfreq on Japanese on Centos 7. I keep getting an error of Couldn't find the MeCab dictionary named 'mecab-ipadic-utf8', however, there's no such package on Centos 7. It's called mecab-ipadic, how can I run wordfreq on Centos 7 in this case? Thank you so much.

English words containing this character "’" are not in the data base

I tokenized an English text that contained short forms like it’ll or you’ve and then got the word frequency for each token. However, for these short forms, the zipf_freq() function gave me a frequency of zero.

Is this a problem with the character, the tokenizer or the data?

Windows 10, Python 3.9, wordfreq 2.5.0

Inconsistent tokenization in Italian depending on the version of regex

We tried to standardize the tokenization of French words such as "l'heure" across different versions of mrab's regex module. The fix assumed that we want this to come out as ["l'", 'heure'], and to recognize this pattern as one or two letters, an apostrophe, and a vowel.

A similar pattern appears in Italian, but it can have four characters before the apostrophe.

On regex 2020.4.4, we get this tokenization:

>>> wordfreq.tokenize("nell'obolo", 'it')
["nell'obolo"]

But on regex 2018.2.21, we get:

>>> wordfreq.tokenize("nell'obolo", 'it')
["nell", "obolo"]

This should be standardized as well. If we insist on a particular version of regex, we will probably cause conflicts with other libraries such as spacy.

Mix-up between Slovenian and Slovak in README.md

Hi!

Thanks for this great application! I just wanted to tell you that the ISO 639-1 code for Slovenian is sl and the code for Slovak is sk.
I think you got that mixed-up in the table in your README.md.
Yes, the two countries have very similar flags, names, and languages :)

Best regards

Upload 2.4 and 2.4.1 to PyPI

Hi! I noticed you've added versions 2.4 and 2.4.1 to the changelog:

https://github.com/LuminosoInsight/wordfreq/blob/master/CHANGELOG.md

But not to PyPI

https://pypi.org/project/wordfreq/#history

Would it be possible to uploade them to PyPI?

newest version does not work?

Hey there,

after pip installing the newest version, I constantly get this error:

>>> from wordfreq import word_frequency
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/xujinghua/miniconda3/lib/python3.7/site-packages/wordfreq/__init__.py", line 16, in <module>
    from .numbers import digit_freq, has_digit_sequence, smash_numbers
  File "/Users/xujinghua/miniconda3/lib/python3.7/site-packages/wordfreq/numbers.py", line 99, in <module>
    def _sub_zeroes(match: regex.Match) -> str:
AttributeError: module 'regex' has no attribute 'Match'

whereas wordfreq 1.4 works fine. Does anyone know how to fix this? Thanks a lot for your help in advance!

Cheers,
Xu

Is there a way to use custom word lists?

Is there a way to use custom word lists? Say if I wanted to know the frequency of the word "whale" in the text of Moby Dick. I would have thought such a task would be within the scope of this library, yet I can't find anything in the documentation about it.

I realise that I could use the tokenize method combined with something like collections.Counter, but that would seem to somewhat defeat the purpose.

I've tried the following but to no avail:

with Path("moby_dick.txt").open() as f:
    moby_dick = f.read()
    tokenized = tokenize(moby_dick, "en")
    whale_freq = word_frequency("whale", "en", wordlist=tokenized)
    print("whale_freq:", whale_freq)

with Path("moby_dick.txt").open() as f:
    moby_dick = f.read()
    whale_freq = word_frequency("whale", "en", wordlist=moby_dick)
    print("whale_freq:", whale_freq)

update msgpack parameter

wordfreq raises warnings because it's using an obsolete parameter to msgpack, encoding='utf-8'. It should be updated to raw=False.

Update Zenodo version to enable updated citation?

Hi,

Would it be possible to publish an updated version of wordfreq on Zenodo, to enable referencing the latest version?
Currently latest version on Zenodo is v2.2.

Thank you for this great package 🥇

Regex version is incompatible with spaCy

I'm interested in using wordfreq in a project I'm doing with spaCy; unfortunately, both wordfreq and spaCy require different versions of regex at an exact version number. Would it be possible to loosen the requirements on the regex version needed?

Letter frequencies

Is it possible to get a character list for each language ordered by frequency?

Frequency lists have high-ranked numbers

Going through the top_n_list in several languages shows numbers such as 00 or 0000 as extremely frequent.

Is this expected behavior? These numbers are clearly as not as common as the library says they are, is there a way to remove these numbers from the function's return so they don't skew the result?

>>> top_n_list('en', 30)
['the', 'of', 'to', 'and', 'a', 'in', 'i', 'is', 'that', 'for', 'you', 'it', 'on', '00', 'with', 'was', 'be', 'this', 'as', 'are', 'not', 'have', 'at', 'he', 'by', 'from', 'but', '0000', 'my', 'an']
>>> top_n_list('es', 30)
['de', 'la', 'que', 'el', 'en', 'y', 'a', 'los', 'no', 'se', 'un', 'por', 'del', 'es', 'las', 'con', 'una', 'para', 'lo', 'su', 'al', '00', 'como', 'me', 'más', 'si', 'pero', 'o', '0000', 'te']
>>> top_n_list('it', 30)
['di', 'e', 'che', 'il', 'la', 'in', 'a', 'non', 'un', 'per', 'è', 'del', 'l', 'i', 'una', 'le', 'si', 'della', 'con', 'da', '00', 'sono', 'ma', 'al', 'come', 'ha', 'più', 'dei', 'se', 'nel']

Argument to specify frequency source

Would be wonderful to be able to get frequencies for the different sources!

Can't install wordfreq[cjk]. no matches found: wordfreq[cjk]

I tried with pip3 and pip as well. Python is set up correctly.

pip3 install wordfreq[cjk] zsh: no matches found: wordfreq[cjk]

How to use Unidic for Japanese?

mecab-python3 itself doesn't recommend ipadic anymore.

In order to use MeCab, you must install a dictionary. There are many different dictionaries available for MeCab. These UniDic packages, which include slight modifications for ease of use, are recommended:

unidic: The latest full UniDic.

unidic-lite: A slightly modified UniDic 2.1.2, chosen for its small size.

The dictionaries below are not recommended due to being unmaintained for many years, but they are available for use with legacy applications.

ipadic

jumandic

For more details on the differences between dictionaries see here.

Furthermore, other tokenizers might also be considered. (but a little out of scope, and can create more confusion, perhaps.)

AttributeError: module 'regex' has no attribute 'Match'

/wordfreq/numbers.py", line 99, in
def _sub_zeroes(match: regex.Match) -> str:
AttributeError: module 'regex' has no attribute 'Match'

Hi, there I got this error, but I got it solved by going into the file "numbers.py" and change

import regex

into:

import re as regex

Did I do the right thing?

Use UNILEX ?

There is a project under Unicode License (GNU like) called Unilex and gathering frequency for 1000 languages. They are based on Google's corpuscrawler, which is python and handfeed links to wordpresses, bible translations, etc. Both projects are on Github.

Suggestion: allow function for `minimum`

I am very grateful for the current minimum argument: it has helped me on many occasions. However, I am wondering if it would be possible for minimum to also accept a function/lamda, since this would simplify a lot of things? For instance, something like:

zipf_frequency("dog", "en", minimum=lambda word: math.log10(len(word)))

rspeer / wordfreq Goto Github PK

wordfreq's People

Contributors

Stargazers

Watchers

Forkers

wordfreq's Issues

Recommend Projects

Recommend Topics

Recommend Org