nltk / nltk_contrib Goto Github PK

NLTK Contrib

License: Other

Makefile 0.04% TeX 1.60% Python 56.23% Shell 0.16% JavaScript 0.15% CSS 0.23% HTML 20.98% Batchfile 0.01% Lex 20.61%

nltk_contrib's Introduction

Natural Language Toolkit, Contrib Area (NLTK-Contrib)   www.nltk.org

Authors: Steven Bird <[email protected]>
         Edward Loper <[email protected]>
         Ewan Klein <[email protected]>

Copyright (C) 2001-2011 NLTK Project

For license information, see LICENSE.txt

nltk_contrib's People

Contributors

Stargazers

Watchers

Forkers

stesh abosamoor saidimu avineshpvs llonchj sp00 timdumol chandangoopta arne-cl colinpollock robyoung relwell swapnilt del82 maheshcr deacondesperado arleincho gwprice115 rudaoshi azizur77 silky hosseinhassani chenmeng86 deltacode tkhyati admackin mdocevski reaco ezubaric giftqiu pmukanova moocedx williamff h4ck3rm1k3 equivocalp avitalp priceonomics myonkorux pralav amit-iiitm yottaworks subhadarsini mctouch sanyaade-nlp viiicky felixonmars knowlp joswinkj sandy4321 snakeztc emilleropenwhere jlmcgehee21 manalishah ilyesdata patilravindra chuanhn faridasabry ank-26 magiciiboy heaven00 bitdancer imryche gaulinmp ali-shahed sih4sing5hong5 sankalpg peacefuleaf davidryden jerrybupt alexxnica kryndex user360 omega-spinti roopalgarg geraldzakwan chemalle varunkashyapks rollingstone thanatchon36 grseb9s julyi biswajeetparida deekthegeek avinashs24 dmowery louismartin sanyaade-machine-learning mrakovic xhuang28 realnicolasbourbaki guysk meghavarshini terencelim eedeleon niknbr douglascodes gaob13 daisey666 one00piece nlpka6j

nltk_contrib's Issues

Found a bug in textgrid.py

Hello,

I found a bug in function "to_oo()" in textgrid.py.

def to_oo(self):
"""
@return: A string in OoTextGrid file format.
"""

    oo_file = ""
    oo_file += "File type = \"ooTextFile\"\n"
    oo_file += "Object class = \"TextGrid\"\n\n"
    oo_file += "xmin = ", self.xmin, "\n"
    oo_file += "xmax = ", self.xmax, "\n"
    oo_file += "tiers? <exists>\n"
    oo_file += "size = ", self.size, "\n"
    oo_file += "item []:\n"

TypeError: cannot concatenate 'str' and 'tuple' objects

May it been written as oo_file += "xmin = "+self.xmin+"\n" and so on.

Package is written in python 2 and can no longer be installed

I have a project that could benefit from the FUF module here but it is written in Python 2. Are there any plans to migrate this project to python 3?

langid demo doesn't seem to work anymore

It appears the langid demo in the misc package isn't up to date. It relies on the nltk.detect module that doesn't seem to exist anymore. Moreover it uses langs(...) on the udhr reader which doesn't seem to exist either.

Would there be a chance to see it updated?

Thanks

Passing Unicode directly raises TypeError in textanalyzer

Perhaps I'm doing something wrong, but it's worth it to check.

My input to the ReadabilityTool is unicode utf-8 text. The input is already encoded, and I received a TypeError when trying to run the tests on it.

Traceback (most recent call last):
  File "/Users/uname/projects/news_genome/news_genome/features.py", line 137, in metrics
    flesch_readability(story),
  File "/Users/uname/projects/news_genome/news_genome/mlstripper.py", line 23, in wrapper
    return fn(text,*args,**kwargs)
  File "/Users/uname/projects/news_genome/news_genome/mlstripper.py", line 30, in wrapper
    ret = fn(*args,**kwargs)
  File "/Users/uname/projects/news_genome/news_genome/features.py", line 49, in flesch_readability
    contrib_score = rt.FleschReadingEase(text)
  File "/usr/local/lib/python2.7/site-packages/nltk_contrib/readability/readabilitytests.py", line 87, in FleschReadingEase
    self.__analyzeText(text)
  File "/usr/local/lib/python2.7/site-packages/nltk_contrib/readability/readabilitytests.py", line 49, in __analyzeText
    words = t.getWords(text)
  File "/usr/local/lib/python2.7/site-packages/nltk_contrib/readability/textanalyzer.py", line 50, in getWords
    text = self._setEncoding(text)
  File "/usr/local/lib/python2.7/site-packages/nltk_contrib/readability/textanalyzer.py", line 130, in _setEncoding
    text = unicode(text, "utf8").encode("utf8")
TypeError: decoding Unicode is not supported

It appears the logic at line 130 in textanalyzer.py expects to perform a encoding that is already performed.

def _setEncoding(self,text):
        try:
            text = unicode(text, "utf8").encode("utf8")
        except UnicodeError:
            try:
                text = unicode(text, "iso8859_1").encode("utf8")
            except UnicodeError:
                text = unicode(text, "ascii", "replace").encode("utf8")
        return text

Is there something I need to configure in order to make the module expect Unicode by default?

Update to textgrid.py

See nltk/nltk#75 and http://code.google.com/p/nltk/issues/detail?id=432 - it seems that was not commited

Why it doesn't recognize explicit date

Hello, I have the following sentence:
"See you in July 18th, 2016".
When using your function "tag", it outputs:
"See you in July 18th, 2016"
I think it should include 'July 18th'. Is there a way to include it?
Also, weekday cannot be identified, for example:
"See you on Monday" --> Monday is not recognized.

Enhanced version of Bioreaderr in nltk_contrib

(migrated from nltk/nltk#149)

Hi
An enhanced version of bioreader in nltk_contrib [http://code.google.com/p/nltk/source/browse/trunk#trunk%2Fnltk_contrib%2Fnltk_contrib%2Fbioreader] directory is available at https://bitbucket.org/jagan/bioreader.
Code clean up and implimentation of coding standards are done

Jaganadh G

Migrated from http://code.google.com/p/nltk/issues/detail?id=661

earlier comments
StevenBird1 said, at 2011-04-08T13:26:40.000Z:

Thanks. Would you please describe what the extra files are for? Also, please remember to use "new style" Python classes.

jaganadhg said, at 2011-04-08T14:43:05.000Z:

Dear Stevan The extra files are programs which I used for testing the program. Now I removed the files from bitbucket. I will impliment the "new style" Python class soon. If possible I will finish it by this week end.

jaganadhg said, at 2011-04-08T15:33:53.000Z:

Dear Stevan Just incorporated the "new style" Python class and also dome some minor corrections in the API documentation Jaganadh G

Release nltk_contrib on PyPI

I find it unnecessarily difficult to install the nltk_contrib package as it is not published on PyPI. I know that you can still install with pip, but I want to list nltk_contrib in the dependency list in my setup.py file.

Please consider pushing a release to PyPI.

Add *.pyc to gitignore

This will make it less messy for people using this code as a submodule.

UnboundLocalError: local variable 'month' referenced before assignment

while calling timex.ground method i found this error.

MUC 7 Corpus reader Crashes

r = LazyCorpusLoader('muc_7/', MUCCorpusReader, 'data/..ne.eng.keys.')
r.iob_sents()
[[('Like', 'O'), ('most', 'O'), ('of', 'O'), ('the', 'O'), ('two', 'O'), ('million', 'O'), ('infants', 'O'), ('under', 'O'), ('2', 'O'), ('who', 'O'), ('fly', 'O'), ('with', 'O'), ('their', 'O'), ('parents', 'O'), ('every', 'O'), ('year', 'O'), (',', 'O'), ('Danasia', 'B-PERSON'), ('was', 'O'), ('traveling', 'O'), ('for', 'O'), ('free', 'O'), (',', 'O'), ('seated', 'O'), ('on', 'O'), ('her', 'O'), ('mother', 'O'), ("'s", 'O'), ('lap', 'O'), ('.', 'O')], [('As', 'O'), ('the', 'O'), ('DC-9', 'O'), ('approached', 'O'), ('the', 'O'), ('airport', 'O'), ('on', 'O'), ('July', 'B-DATE'), ('2', 'I-DATE'), (',', 'I-DATE'), ('1994', 'I-DATE'), (',', 'O'), ('wind', 'O'), ('shear', 'O'), ('slammed', 'O'), ('the', 'O'), ('plane', 'O'), ('to', 'O'), ('the', 'O'), ('ground', 'O'), ('.', 'O')], ...]
len(r.iob_sents())
[Tree('S', ['The', Tree('ORGANIZATION', ['Unicef']), 'Flyer', 'flight', 'suffered', 'a', 'setback', Tree('DATE', ['Dec', '.'])]), Tree('DATE', ['Dec', '.'])]
Traceback (most recent call last):
File "", line 1, in
File "/usr/local/lib/python2.7/dist-packages/nltk-2.0.1rc4-py2.7.egg/nltk/util.py", line 966, in len
return max(len(lst) for lst in self._lists)
File "/usr/local/lib/python2.7/dist-packages/nltk-2.0.1rc4-py2.7.egg/nltk/util.py", line 966, in
return max(len(lst) for lst in self._lists)
File "/usr/local/lib/python2.7/dist-packages/nltk-2.0.1rc4-py2.7.egg/nltk/util.py", line 807, in len
if len(self._offsets) <= len(self._list):
File "/usr/local/lib/python2.7/dist-packages/nltk-2.0.1rc4-py2.7.egg/nltk/util.py", line 966, in len
return max(len(lst) for lst in self._lists)
File "/usr/local/lib/python2.7/dist-packages/nltk-2.0.1rc4-py2.7.egg/nltk/util.py", line 966, in
return max(len(lst) for lst in self._lists)
File "/usr/local/lib/python2.7/dist-packages/nltk-2.0.1rc4-py2.7.egg/nltk/util.py", line 807, in len
if len(self._offsets) <= len(self._list):
File "/usr/local/lib/python2.7/dist-packages/nltk-2.0.1rc4-py2.7.egg/nltk/util.py", line 966, in len
return max(len(lst) for lst in self._lists)
File "/usr/local/lib/python2.7/dist-packages/nltk-2.0.1rc4-py2.7.egg/nltk/util.py", line 966, in
return max(len(lst) for lst in self._lists)
File "/usr/local/lib/python2.7/dist-packages/nltk-2.0.1rc4-py2.7.egg/nltk/corpus/reader/util.py", line 379, in len
for tok in self.iterate_from(self._offsets[-1]): pass
File "/usr/local/lib/python2.7/dist-packages/nltk-2.0.1rc4-py2.7.egg/nltk/corpus/reader/util.py", line 401, in iterate_from
for tok in piece.iterate_from(max(0, start_tok-offset)):
File "/usr/local/lib/python2.7/dist-packages/nltk-2.0.1rc4-py2.7.egg/nltk/corpus/reader/util.py", line 298, in iterate_from
tokens = self.read_block(self._stream)
File "/usr/local/lib/python2.7/dist-packages/nltk_contrib/coref/muc.py", line 419, in _read_parsed_block
return map(self._parse, self._read_block(stream))
File "/usr/local/lib/python2.7/dist-packages/nltk_contrib/coref/muc.py", line 428, in _parse
tree = mucstr2tree(doc, top_node='DOC')
File "/usr/local/lib/python2.7/dist-packages/nltk_contrib/coref/muc.py", line 468, in mucstr2tree
'text': _muc_read_text(match.group('text'), top_node),
File "/usr/local/lib/python2.7/dist-packages/nltk_contrib/coref/muc.py", line 534, in _muc_read_text
tree[-1].append(_muc_read_words(sent, 'S'))
File "/usr/local/lib/python2.7/dist-packages/nltk_contrib/coref/muc.py", line 558, in _muc_read_words
assert len(stack) == 1
AssertionError

Rewritten readability tests from nltk_contrib

Hi,

I wanted to use readability tests from nltk_contrib package. The code does not meet my requirements (like choosing between Dutch and English) and I have rewritten it.

I put it there so others could use the work as well. Maybe it could become a part of official nltk package? I would like to edit it so it meets the code standard of nltk.

Migrated from http://code.google.com/p/nltk/issues/detail?id=677

earlier comments
alex.rudnick said, at 2011-05-24T06:34:59.000Z:

Thanks for submitting updates! Could you outline the changes you made? Improvements are always welcome! Are you interested in contributing this as part of nltk_contrib? (could it maybe be merged in to the existing readability package, or would you prefer it to be separate?)

izidor.matusov said, at 2011-05-24T06:58:19.000Z:

I made these changes: 1) The original readability tests module of nltk_contrib contains code which is not related to readability tests. That code was removed. 2) Repaired a few lines so the code actually runs in the current version of Python. 3) Removed support for Dutch 4) Polished interface 5) Rewritten code, so pylint does not nag so much. Yes, I am interested. It could be merged but there are many radical changes so it could be a problem for someone who uses the previous readability package.

Migrated from nltk/nltk#161

readabilitytests problem with utf-8 characters

I ran into a problem trying to apply the readability tests to a block of text with some UTF-8 characters (fancy quotes).

Sample text: http://pastebin.com/eRKGMGYn

Test script: http://pastebin.com/aE2DaRvk

I'm not very familiar with nltk_contrib, so perhaps I'm just using it wrong...but it seems to fail regardless of whether I pass in a bytestring or unicode string to ReadabilityTool. I forked nltk_contrib and changed textanalyzer.py so that it takes unicode instead of bytes, and that seems to have fixed the problem for me.

My fork: https://github.com/priceonomics/nltk_contrib

Can someone confirm the issue I'm seeing and whether my fix is appropriate? Feel free to merge it back if it's useful.