Giter VIP home page Giter VIP logo

nlpub / pymystem3 Goto Github PK

View Code? Open in Web Editor NEW
289.0 19.0 42.0 101 KB

A Python wrapper of the Yandex Mystem 3.1 morphological analyzer (http://api.yandex.ru/mystem). The original tool is shipped as a binary and this library makes it easy to integrate it in Python projects. Let us know in the issues if you would like to be involved into the developments or maintenance of this project. If you have any fix or suggestion, please make a pull request. We are very open to accepting any contributions.

Home Page: https://nlpub.ru/Mystem

License: Other

Python 100.00%
pos tagging tagger lemma lemmatization lemmatizer russian language morphology mystem

pymystem3's People

Contributors

afefelov avatar alexanderpanchenko avatar crazyministr avatar daskol avatar deni64k avatar hiyorimi avatar pantherka avatar smoke-b avatar vasinkd avatar x0wllaar avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

pymystem3's Issues

Pypi upload doesn't match the code at this repository

Hello. It turns out that this repo is missing some changes that are made to the same version uploaded to pypi.python.org. One this I have found is python3 compatibility: unicode is replaced with str. Could you please provide these changes or maybe I am missing out something?

Offline installation missing

In my organisation I have to deal with firewall, some URLs are simply blocked. In this case first usage of pymystem3 looks like:
Microsoft Windows [Version 6.1.7601]
C:\Users\qtros>ipython
Python 3.5.4 |Anaconda custom (64-bit)| (default, Aug 14 2017, 13:41:13) [MSC v.
1900 64 bit (AMD64)]
Type 'copyright', 'credits' or 'license' for more information
IPython 6.1.0 -- An enhanced Interactive Python. Type '?' for help.
`In [1]: from pymystem3 import Mystem`
In [2]: m = Mystem()
Installing mystem to C:\Users\qtros/.local/bin\mystem.exe from http://downloa
d.cdn.yandex.net/mystem/mystem-3.0-win7-64bit.zip
---------------------------------------------------------------------------
BadZipFile Traceback (most recent call last)
<ipython-input-2-489909c69c84> in <module>()
----> 1 m = Mystem()
`C:\Program Files\Anaconda3\lib\site-packages\pymystem3\mystem.py in __init__(sel` `f, mystem_bin, grammar_info, disambiguation, entire_input)` ` 177` ` 178 if self._mystem_bin is None:` `--> 179 autoinstall()` ` 180 self._mystem_bin = MYSTEM_BIN` ` 181`
C:\Program Files\Anaconda3\lib\site-packages\pymystem3\mystem.py in autoinstall(
out)
59 if os.path.isfile(MYSTEM_BIN):
60 return
---> 61 install(out)
62
63
`C:\Program Files\Anaconda3\lib\site-packages\pymystem3\mystem.py in install(out)`
95 elif url.endswith('.zip'):
96 import zipfile
---> 97 zip = zipfile.ZipFile(tmp_path)
98 try:
99 zip.extractall(MYSTEM_DIR)
`C:\Program Files\Anaconda3\lib\zipfile.py in __init__(self, file, mode, compress` `ion, allowZip64)` ` 1024 try:` ` 1025 if mode == 'r':` `-> 1026 self._RealGetContents()` ` 1027 elif mode in ('w', 'x'):` ` 1028 # set the modified flag so central directory gets writte` `n`
C:\Program Files\Anaconda3\lib\zipfile.py in _RealGetContents(self)
1091 raise BadZipFile("File is not a zip file")
1092 if not endrec:
-> 1093 raise BadZipFile("File is not a zip file")
1094 if self.debug > 1:
1095 print(endrec)
`BadZipFile: File is not a zip file`
In [3]:
That's because network security tools replaced actual zip content with some kind of error page, 'You are prohibited to do this, bla bla bla'. How can I do installation in offline mode?

mystem.lemmatize() fails when runs in celery worker

pymystem3 = "^0.2.0"

Steps to reproduce:

  1. Run mystem.lemmatize() on string longer than 65535 chars in celery worker.
  2. Get exception
 (yexception) util/charset/wide.h:295: failed to decode UTF-8 string at pos 65535
Aborted.

Bug reproduces. However in usual runtime it runs smoothly

Slow lemmatization on Windows

This code in Jupyter notebook:

%%time
from pymystem3 import Mystem
m = Mystem()
text = "Красивая мама красиво мыла раму"
for i in range(10):
    lemmas = m.lemmatize(text)

Outputs Wall time: 9.27 s
Every iteration takes almost 1 second. This is dramatically different from results I observe in Ubuntu.
My configuration: Windows 7 x64, i7 4770 3.6Ghz, 16GB RAM

Pipeline on Windows

Can someone please explain why the pipeline mode is disabled on non-POSIX platforms? Enabling it doesn't give me any errors and at least seems to provide a correct output. I can't rigorously check the output for correctness at the moment, but even if I'm wrong, I think it's worthwhile to add some documentation since disabling pipeline makes mystem work on multiple inputs slower by ~3 orders of magnitude.

TypeError: a bytes-like object is required, not 'NoneType' in _analyze_impl

Hi. I have a question.

I've got a bunch of large .csv files and parse it via multiprocessing Pool

res = [pool.apply_async(func=func, args=(file,)) for file in files]

So when I try do lemmatize(ru_text) I got exception on:
https://github.com/nlpub/pymystem3/blob/master/pymystem3/mystem.py#L370

sio.write(out)
TypeError: a bytes-like object is required, not 'NoneType'

I don't get it how https://github.com/nlpub/pymystem3/blob/master/pymystem3/mystem.py#L369 can read None? It because of stdout non-blocking?

out = self._procout.read()

troubles with pymystem3 in google colab

Trying to install the library to Google this. If you install it from github, it works, but it does not return anything, and if you install it via PIP, it just freezes

processes of mystem still alive

I use .lemmatize() in many threads(those threading.Thread)
There are some "..bin/mystem --format json -gi -d -c" still alive after my threads are finished.
Look at pic to keep it clear
screenshot at 00-54-08
is there a way to finish these processes? it consumes large amount of memory after few work days

PermissionError: [Errno 13] Permission denied

when I call m = Mystem() it gives me an error "permission denied".
Installing mystem to /Users/magzhankairanbay/.local/bin/mystem from http://download.cdn.yandex.net/mystem/mystem-3.1-macosx.tar.gz
Traceback (most recent call last):
File "", line 1, in
File "/Users/magzhankairanbay/anaconda3/lib/python3.6/site-packages/pymystem3/mystem.py", line 178, in init
autoinstall()
File "/Users/magzhankairanbay/anaconda3/lib/python3.6/site-packages/pymystem3/mystem.py", line 57, in autoinstall
install(out)
File "/Users/magzhankairanbay/anaconda3/lib/python3.6/site-packages/pymystem3/mystem.py", line 88, in install
tar.extract(MYSTEM_EXE, MYSTEM_DIR)
File "/Users/magzhankairanbay/anaconda3/lib/python3.6/tarfile.py", line 2052, in extract
numeric_owner=numeric_owner)
File "/Users/magzhankairanbay/anaconda3/lib/python3.6/tarfile.py", line 2122, in _extract_member
self.makefile(tarinfo, targetpath)
File "/Users/magzhankairanbay/anaconda3/lib/python3.6/tarfile.py", line 2163, in makefile
with bltn_open(targetpath, "wb") as target:
PermissionError: [Errno 13] Permission denied: '/Users/magzhankairanbay/.local/bin/mystem'

I am using Mac

Lemmatization errors in verbs

Here comes a (non-exhaustive) list of verb forms lemmatized incorrectly by Mystem(). The list was compiled as part of GramEval2020 evaluation survey of baseline tools. Based on UD-SynTagRus v.2.5.
FORM - verb form
LEMMA_MANUAL - lemma assigned by expert
LEMMA_UDPIPE - lemma given in UD-SynTagRus (and thus assigned by udpipe model)
LEMMA_MYSTEM3 - lemma assigned by pymystem3

lemmas_wrong_choice.txt

ResourceWarning raising

If one calls resetwarnings(), starts mystem and lemmatizes something then python will notify with ResouceWarning due to opened IO pipes like following.

/lib/python3.6/site-packages/pymystem3/mystem.py:210: ResourceWarning: unclosed file <_io.FileIO name=5 mode='wb' closefd=True>
  self._proc = None
/lib/python3.6/site-packages/pymystem3/mystem.py:210: ResourceWarning: unclosed file <_io.FileIO name=6 mode='rb' closefd=True>

Your system is not supported.

I tried to launch the sample piece of code to try out pymystem3.
from pymystem3 import Mystem
text = "Красивая мама красиво мыла раму"
m = Mystem()
lemmas = m.lemmatize(text)
print(''.join(lemmas))
The module had been installed successfully (version 0.2.0.), but the error is printed:

NotImplementedError: Your system is not supported. Feel free to report bug or make a pull request.

What's going wrong? How can I fix it?

BrokenPipeError in long-running process

It seems that if the mystem process dies for some reason (e. g. OS kills it), the wrapper crashes with the BrokenPipeError. I think that such behavior is unwanted, but I am not sure this could be fixed easily. I fixed it in my long-running process with catching the BrokenPipeError on Mystem.lemmatize call and recreating a Mystem object if needed.

The traceback:

  File "/usr/local/lib/python3.5/dist-packages/pymystem3/mystem.py", line 264, in lemmatize
    infos = self.analyze(text)
  File "/usr/local/lib/python3.5/dist-packages/pymystem3/mystem.py", line 249, in analyze
    result.extend(self._analyze_impl(line))
  File "/usr/local/lib/python3.5/dist-packages/pymystem3/mystem.py", line 280, in _analyze_impl
    self._procin.write(text)
BrokenPipeError: [Errno 32] Broken pipe

P. S. I have also noticed similar behavior in interactive Jupyter Notebook scenario. If you kill a cell which uses a Mystem object, later calls to its lemmatize methods lead to BrokenPipeError, so you need to recreate the object.

Slow lemmatization on Windows

Well, I found a bad hack solution for increasing performance for lemmatize long texts. I added comment in #14 but I think it's better to create new issue for this.

So in file mystem.py we have function

    def analyze(self, text):
        """
        Make morphology analysis for a text.
        :param  text:   text to analyze
        :type   text:   str
        :returns:       result of morphology analysis.
        :rtype:         dict
        """

        result = []
        for line in text.splitlines():
            try:
                result.extend(self._analyze_impl(line))
            except broken_pipe:
                self.close()
                self.start()
                result.extend(self._analyze_impl(line))
        return result

We can change this one for not _PIPELINE_MODE:

if not _PIPELINE_MODE:
    def analyze(self, text):
        """
        Make morphology analysis for a text.
        :param  text:   text to analyze
        :type   text:   str
        :returns:       result of morphology analysis.
        :rtype:         dict
        """

        result = []
        span = 2000
        lines = text.splitlines()
        lines = [" ".join(lines[i:i+span]) for i in range(0, len(lines), span)]

        for line in lines:
            try:
                result.extend(self._analyze_impl(line))
            except broken_pipe:
                self.close()
                self.start()
                result.extend(self._analyze_impl(line))
        return result

This changes dramatically increase performance in Windows. Unfortunately I don't know if it's fine to send such long "lines" to Mystem. But it works for me.

Windows + Python 3.4

Вот такая ошибка возникает при попытке запустить стандартный пример из документации

C:\Python34>python.exe Scripts\test.py> test_py.out
Traceback (most recent call last):
File "Scripts\test.py", line 4, in
lemmas = m.lemmatize(text)
File "C:\Python34\lib\site-packages\pymystem3\mystem.py", line 235, in lemmati
ze
infos = self.analyze(text)
File "C:\Python34\lib\site-packages\pymystem3\mystem.py", line 220, in analyze

result.extend(self._analyze_impl(line))

File "C:\Python34\lib\site-packages\pymystem3\mystem.py", line 285, in analyz
e_impl
obj = json.loads(out)
File "C:\Python34\lib\json__init
_.py", line 312, in loads
s.class.name))
TypeError: the JSON object must be str, not 'bytes'

Apple Silicon support

Hi. I want to use this morph analyze for my project, but i get some error.

Python 3.7.3 (default, Jun 19 2019, 01:54:03) 
[GCC 6.3.0 20170516] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from pymystem3 import Mystem
>>> m = Mystem(
... )
Installing mystem to /root/.local/bin/mystem from http://download.cdn.yandex.net/mystem/mystem-3.1-linux-64bit.tar.gz
>>> text = "Красивая"
>>> lemma = m.lemmatize(text)
qemu-x86_64: Could not open '/lib64/ld-linux-x86-64.so.2': No such file or directory

Can this be fixed?

My enviroment:
macbook air m1
pymystem3 installed from github
python 3.7 in docker container

Thanks for you attention.

Error when launch in docker

Hello.
I try to launch mystem in docker with multiprocessing and get error:
OSError: [Errno 26] Text file busy: '/root/.local/bin/mystem'

Dockerfile
FROM python:3.8-buster
WORKDIR /usr/src/app
COPY . .
RUN python3.8 -m pip install --upgrade pip
RUN python3.8 -m pip install --no-cache-dir pymystem3
CMD ["python3.8", "script.py"]

script.py
import multiprocessing
import time
from pymystem3 import Mystem
def lemma(data):
process_n = data[0]
texts = data[1]
print("process_n", process_n, "Создали Mystem()")
m = Mystem()
lemma_text = m.lemmatize(' '.join(texts))
m.close()
print("process_n", process_n, "Закрыли Mystem()")
return lemma_text

def lemmatisation_text():
n_core = 4
texts = ['Мама моет раму, Рама держит маму.' for x in range(1000)]
# Добавим номера процессов
params = [[core, texts] for core in range(n_core)]
print('pool start')
pool = multiprocessing.Pool(n_core)
print('pool map')
res = pool.map(lemma, params)
print('pool close')
pool.close()
print('pool join')
pool.join()
print(len(res))

if name == 'main':
start = time.time()
lemmatisation_text()
print(time.time() - start)

Memory leakage

There is no proper destroyer. As sequence, memory leaks if object of MyStem is destroyed but mystem is alive. It destroys after closing the parent application because of Linux.

Try to run the following.

mystem = Mystem().start()
del mystem

GC collected mystem but process mystem is still running.

Enable pipepline mode for pypy

Right now pipeline mode is disabled for pypy, but it works at least for PyPy 2.5.1. Can it be turned on, either based on version number, or on feature detection? I can check what was wrong on pypy 1.9

\0 causes freeze

If the input string contains \0 character, Mystem does not respond.

Code to reproduce:

from pymystem3 import Mystem
m = Mystem()
m.lemmatize('тест\u0000')

After that, the program freezes. No errors, no warnings. When I send KeyboardInterrupt (^C), I see the next Traceback:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/nokados/anaconda3/lib/python3.6/site-packages/pymystem3/mystem.py", line 250, in lemmatize
    infos = self.analyze(text)
  File "/home/nokados/anaconda3/lib/python3.6/site-packages/pymystem3/mystem.py", line 235, in analyze
    result.extend(self._analyze_impl(line))
  File "/home/nokados/anaconda3/lib/python3.6/site-packages/pymystem3/mystem.py", line 273, in _analyze_impl
    select.select([self._procout_no], [], [])
KeyboardInterrupt

Python 3.6.5
pymystem3 0.1.5, 0.1.10 , 0.2.0 (all 3 were tested)
Mystem 3.0

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.