Giter VIP home page Giter VIP logo

nltk_contrib's Introduction

nltk_contrib's People

Contributors

abosamoor avatar alexrudnick avatar bitdancer avatar bmaland avatar brendanwood avatar ccrowner avatar felipe-dachshund avatar felixonmars avatar h4ck3rm1k3 avatar jfrazee avatar kmike avatar robyoung avatar rwsproat avatar stevenbird avatar xim avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

nltk_contrib's Issues

Found a bug in textgrid.py

Hello,

I found a bug in function "to_oo()" in textgrid.py.

def to_oo(self):
"""
@return: A string in OoTextGrid file format.
"""

    oo_file = ""
    oo_file += "File type = \"ooTextFile\"\n"
    oo_file += "Object class = \"TextGrid\"\n\n"
    oo_file += "xmin = ", self.xmin, "\n"
    oo_file += "xmax = ", self.xmax, "\n"
    oo_file += "tiers? <exists>\n"
    oo_file += "size = ", self.size, "\n"
    oo_file += "item []:\n"

TypeError: cannot concatenate 'str' and 'tuple' objects

May it been written as oo_file += "xmin = "+self.xmin+"\n" and so on.

langid demo doesn't seem to work anymore

It appears the langid demo in the misc package isn't up to date. It relies on the nltk.detect module that doesn't seem to exist anymore. Moreover it uses langs(...) on the udhr reader which doesn't seem to exist either.

Would there be a chance to see it updated?

Thanks

Passing Unicode directly raises TypeError in textanalyzer

Perhaps I'm doing something wrong, but it's worth it to check.

My input to the ReadabilityTool is unicode utf-8 text. The input is already encoded, and I received a TypeError when trying to run the tests on it.

Traceback (most recent call last):
  File "/Users/uname/projects/news_genome/news_genome/features.py", line 137, in metrics
    flesch_readability(story),
  File "/Users/uname/projects/news_genome/news_genome/mlstripper.py", line 23, in wrapper
    return fn(text,*args,**kwargs)
  File "/Users/uname/projects/news_genome/news_genome/mlstripper.py", line 30, in wrapper
    ret = fn(*args,**kwargs)
  File "/Users/uname/projects/news_genome/news_genome/features.py", line 49, in flesch_readability
    contrib_score = rt.FleschReadingEase(text)
  File "/usr/local/lib/python2.7/site-packages/nltk_contrib/readability/readabilitytests.py", line 87, in FleschReadingEase
    self.__analyzeText(text)
  File "/usr/local/lib/python2.7/site-packages/nltk_contrib/readability/readabilitytests.py", line 49, in __analyzeText
    words = t.getWords(text)
  File "/usr/local/lib/python2.7/site-packages/nltk_contrib/readability/textanalyzer.py", line 50, in getWords
    text = self._setEncoding(text)
  File "/usr/local/lib/python2.7/site-packages/nltk_contrib/readability/textanalyzer.py", line 130, in _setEncoding
    text = unicode(text, "utf8").encode("utf8")
TypeError: decoding Unicode is not supported

It appears the logic at line 130 in textanalyzer.py expects to perform a encoding that is already performed.

def _setEncoding(self,text):
        try:
            text = unicode(text, "utf8").encode("utf8")
        except UnicodeError:
            try:
                text = unicode(text, "iso8859_1").encode("utf8")
            except UnicodeError:
                text = unicode(text, "ascii", "replace").encode("utf8")
        return text

Is there something I need to configure in order to make the module expect Unicode by default?

Why it doesn't recognize explicit date

Hello, I have the following sentence:
"See you in July 18th, 2016".
When using your function "tag", it outputs:
"See you in July 18th, 2016"
I think it should include 'July 18th'. Is there a way to include it?
Also, weekday cannot be identified, for example:
"See you on Monday" --> Monday is not recognized.

Enhanced version of Bioreaderr in nltk_contrib

(migrated from nltk/nltk#149)


Hi
An enhanced version of bioreader in nltk_contrib [http://code.google.com/p/nltk/source/browse/trunk#trunk%2Fnltk_contrib%2Fnltk_contrib%2Fbioreader] directory is available at https://bitbucket.org/jagan/bioreader.
Code clean up and implimentation of coding standards are done

Jaganadh G

Migrated from http://code.google.com/p/nltk/issues/detail?id=661

earlier comments
StevenBird1 said, at 2011-04-08T13:26:40.000Z:

Thanks. Would you please describe what the extra files are for? Also, please remember to use "new style" Python classes.

jaganadhg said, at 2011-04-08T14:43:05.000Z:

Dear Stevan The extra files are programs which I used for testing the program. Now I removed the files from bitbucket. I will impliment the "new style" Python class soon. If possible I will finish it by this week end.

jaganadhg said, at 2011-04-08T15:33:53.000Z:

Dear Stevan Just incorporated the "new style" Python class and also dome some minor corrections in the API documentation Jaganadh G

Release nltk_contrib on PyPI

I find it unnecessarily difficult to install the nltk_contrib package as it is not published on PyPI. I know that you can still install with pip, but I want to list nltk_contrib in the dependency list in my setup.py file.

Please consider pushing a release to PyPI.

MUC 7 Corpus reader Crashes

r = LazyCorpusLoader('muc_7/', MUCCorpusReader, 'data/..ne.eng.keys.')
r.iob_sents()
[[('Like', 'O'), ('most', 'O'), ('of', 'O'), ('the', 'O'), ('two', 'O'), ('million', 'O'), ('infants', 'O'), ('under', 'O'), ('2', 'O'), ('who', 'O'), ('fly', 'O'), ('with', 'O'), ('their', 'O'), ('parents', 'O'), ('every', 'O'), ('year', 'O'), (',', 'O'), ('Danasia', 'B-PERSON'), ('was', 'O'), ('traveling', 'O'), ('for', 'O'), ('free', 'O'), (',', 'O'), ('seated', 'O'), ('on', 'O'), ('her', 'O'), ('mother', 'O'), ("'s", 'O'), ('lap', 'O'), ('.', 'O')], [('As', 'O'), ('the', 'O'), ('DC-9', 'O'), ('approached', 'O'), ('the', 'O'), ('airport', 'O'), ('on', 'O'), ('July', 'B-DATE'), ('2', 'I-DATE'), (',', 'I-DATE'), ('1994', 'I-DATE'), (',', 'O'), ('wind', 'O'), ('shear', 'O'), ('slammed', 'O'), ('the', 'O'), ('plane', 'O'), ('to', 'O'), ('the', 'O'), ('ground', 'O'), ('.', 'O')], ...]
len(r.iob_sents())
[Tree('S', ['The', Tree('ORGANIZATION', ['Unicef']), 'Flyer', 'flight', 'suffered', 'a', 'setback', Tree('DATE', ['Dec', '.'])]), Tree('DATE', ['Dec', '.'])]
Traceback (most recent call last):
File "", line 1, in
File "/usr/local/lib/python2.7/dist-packages/nltk-2.0.1rc4-py2.7.egg/nltk/util.py", line 966, in len
return max(len(lst) for lst in self._lists)
File "/usr/local/lib/python2.7/dist-packages/nltk-2.0.1rc4-py2.7.egg/nltk/util.py", line 966, in
return max(len(lst) for lst in self._lists)
File "/usr/local/lib/python2.7/dist-packages/nltk-2.0.1rc4-py2.7.egg/nltk/util.py", line 807, in len
if len(self._offsets) <= len(self._list):
File "/usr/local/lib/python2.7/dist-packages/nltk-2.0.1rc4-py2.7.egg/nltk/util.py", line 966, in len
return max(len(lst) for lst in self._lists)
File "/usr/local/lib/python2.7/dist-packages/nltk-2.0.1rc4-py2.7.egg/nltk/util.py", line 966, in
return max(len(lst) for lst in self._lists)
File "/usr/local/lib/python2.7/dist-packages/nltk-2.0.1rc4-py2.7.egg/nltk/util.py", line 807, in len
if len(self._offsets) <= len(self._list):
File "/usr/local/lib/python2.7/dist-packages/nltk-2.0.1rc4-py2.7.egg/nltk/util.py", line 966, in len
return max(len(lst) for lst in self._lists)
File "/usr/local/lib/python2.7/dist-packages/nltk-2.0.1rc4-py2.7.egg/nltk/util.py", line 966, in
return max(len(lst) for lst in self._lists)
File "/usr/local/lib/python2.7/dist-packages/nltk-2.0.1rc4-py2.7.egg/nltk/corpus/reader/util.py", line 379, in len
for tok in self.iterate_from(self._offsets[-1]): pass
File "/usr/local/lib/python2.7/dist-packages/nltk-2.0.1rc4-py2.7.egg/nltk/corpus/reader/util.py", line 401, in iterate_from
for tok in piece.iterate_from(max(0, start_tok-offset)):
File "/usr/local/lib/python2.7/dist-packages/nltk-2.0.1rc4-py2.7.egg/nltk/corpus/reader/util.py", line 298, in iterate_from
tokens = self.read_block(self._stream)
File "/usr/local/lib/python2.7/dist-packages/nltk_contrib/coref/muc.py", line 419, in _read_parsed_block
return map(self._parse, self._read_block(stream))
File "/usr/local/lib/python2.7/dist-packages/nltk_contrib/coref/muc.py", line 428, in _parse
tree = mucstr2tree(doc, top_node='DOC')
File "/usr/local/lib/python2.7/dist-packages/nltk_contrib/coref/muc.py", line 468, in mucstr2tree
'text': _muc_read_text(match.group('text'), top_node),
File "/usr/local/lib/python2.7/dist-packages/nltk_contrib/coref/muc.py", line 534, in _muc_read_text
tree[-1].append(_muc_read_words(sent, 'S'))
File "/usr/local/lib/python2.7/dist-packages/nltk_contrib/coref/muc.py", line 558, in _muc_read_words
assert len(stack) == 1
AssertionError

Rewritten readability tests from nltk_contrib

Hi,

I wanted to use readability tests from nltk_contrib package. The code does not meet my requirements (like choosing between Dutch and English) and I have rewritten it.

I put it there so others could use the work as well. Maybe it could become a part of official nltk package? I would like to edit it so it meets the code standard of nltk.

Migrated from http://code.google.com/p/nltk/issues/detail?id=677

earlier comments
alex.rudnick said, at 2011-05-24T06:34:59.000Z:

Thanks for submitting updates! Could you outline the changes you made? Improvements are always welcome! Are you interested in contributing this as part of nltk_contrib? (could it maybe be merged in to the existing readability package, or would you prefer it to be separate?)

izidor.matusov said, at 2011-05-24T06:58:19.000Z:

I made these changes: 1) The original readability tests module of nltk_contrib contains code which is not related to readability tests. That code was removed. 2) Repaired a few lines so the code actually runs in the current version of Python. 3) Removed support for Dutch 4) Polished interface 5) Rewritten code, so pylint does not nag so much. Yes, I am interested. It could be merged but there are many radical changes so it could be a problem for someone who uses the previous readability package.


Migrated from nltk/nltk#161

readabilitytests problem with utf-8 characters

I ran into a problem trying to apply the readability tests to a block of text with some UTF-8 characters (fancy quotes).

Sample text: http://pastebin.com/eRKGMGYn

Test script: http://pastebin.com/aE2DaRvk

I'm not very familiar with nltk_contrib, so perhaps I'm just using it wrong...but it seems to fail regardless of whether I pass in a bytestring or unicode string to ReadabilityTool. I forked nltk_contrib and changed textanalyzer.py so that it takes unicode instead of bytes, and that seems to have fixed the problem for me.

My fork: https://github.com/priceonomics/nltk_contrib

Can someone confirm the issue I'm seeing and whether my fix is appropriate? Feel free to merge it back if it's useful.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.