machinalis / iepy Goto Github PK

View Code? Open in Web Editor NEW

903.0 68.0 188.0 12.13 MB

Information Extraction in Python

License: BSD 3-Clause "New" or "Revised" License

Python 83.72% CSS 5.32% JavaScript 5.47% HTML 5.50%

iepy's Introduction

IEPY

IEPY is an open source tool for Information Extraction focused on Relation Extraction.

To give an example of Relation Extraction, if we are trying to find a birth date in:

"John von Neumann (December 28, 1903 – February 8, 1957) was a Hungarian and American pure and applied mathematician, physicist, inventor and polymath."

then IEPY's task is to identify "John von Neumann" and "December 28, 1903" as the subject and object entities of the "was born in" relation.

It's aimed at:

users needing to perform Information Extraction on a large dataset.
scientists wanting to experiment with new IE algorithms.

Features

A corpus annotation tool with a web-based UI

An active learning relation extraction tool pre-configured with convenient defaults.

A rule based relation extraction tool for cases where the documents are semi-structured or high precision is required.

A web-based user interface that:

Allows layman users to control some aspects of IEPY.

Allows decentralization of human input.

A shallow entity ontology with coreference resolution via Stanford CoreNLP

An easily hack-able active learning core, ideal for scientist wanting to experiment with new algorithms.

Installation

Install the required packages:

sudo apt-get install build-essential python3-dev liblapack-dev libatlas-dev gfortran openjdk-7-jre

Then simply install with pip:

pip install iepy

Full details about the installation is available on the Read the Docs page.

Running the tests

If you are contributing to the project and want to run the tests, all you have to do is:

Make sure your JAVAHOME is correctly set. Read more about it here

In the root of the project run nosetests

Learn more

The full documentation is available on Read the Docs.

Authors

Rafael Carrascosa <[email protected]> (rafacarrascosa at github)

Javier Mansilla <[email protected]> (jmansilla at github)

Gonzalo García Berrotarán <[email protected]> (j0hn at github)

Franco M. Luque <[email protected]> (francolq at github)

Daniel Moisset <[email protected]> (dmoisset at github)

You can follow the development of this project and report issues at http://github.com/machinalis/iepy

You can join the mailing list here

iepy's People

Contributors

Stargazers

Watchers

Forkers

oztc copy-bin marcossponton imclab jacobke ahurriyetoglu greatschism maximaube rjaime theblueskies cbertelegni lowks subhashroy snazz2001 silky horacioibrahim ajufrancis jeffersonk arunenigma narayana1208 maruthiprithivi jjelosua mspandit chagge aurora1625 52nlp xsongx redreamality lai-bluejay likaiguo rudaoshi julesol manish211 ganiserb daviddjchen oiclid vambati arunsingh girishponkiya danielmckeown augustopujato mrshu shanbady windy-ground parthasen suryanarayadev ecnumjc hihiy ezesalta prashiyn adityakiran jxrgxn shannonyu wavelets nikolayvoronchikhin solertis pengyuange sanchi1991 anukat2015 perfettiful datnamer donnut jinyu0310 xiliangsong ashhher3 dav009 javelir francolq shuaiyan ljdawn vishnumani2009 mongolia19 dmoisset mixcoder yejunbin akoumjian hemel-cse abeosoft wujiu913 xtmhm2000 dailyactie lxj0276 awolfmann nowucme kkyon jianantian jack19861225 bobquest33 mohendra shashank31mar xcopyco tweetsharp yogeshkad zhangruiskyline 707gp sureshsagir algotrader-dotcom jasonhoou colinsongf stephensebastin

iepy's Issues

Import/Export of annotated corpora

It would be great to have a tool to import and export annotated corpora.
That would allow us to easily share the corpora that we already have (perdate and orgloc) and to receive contributions from future IEPY users.

IEPY raises IndexError sometimes when classifier is using weighted classes

Classifier configuration:

    "classifier_config": {
        "classifier": "svm",
        "classifier_args": {
            "class_weight": {
                "false": 1,
                "true": 1
            },
            "gamma": 0.0,
            "kernel": "rbf"
        },
        "dimensionality_reduction": null,
        "dimensionality_reduction_dimension": null,
        "feature_selection": "frequency_filter",
        "feature_selection_dimension": 5,
        "features": [
            "bag_of_words_in_between",
            "bag_of_pos_in_between",
            "bag_of_wordpos_in_between",
            "entity_order",
            "entity_distance",
            "other_entities_in_between",
            "verbs_count_in_between"
        ],
        "scaler": true,
        "sparse": true
    },

Full traceback:

Processing |                                | 1/2882014-06-09 13:27:39,152 - root - ERROR - Experiment failed because of IndexError index out of bounds, skipping...
Traceback (most recent call last):
  File "experimentation/loop/experiment_runner.py", line 338, in <module>
    use_git_info_from_path=path)
  File "/home/jmansilla/.virtualenvs/iepy/local/lib/python2.7/site-packages/featureforge/experimentation/runner.py", line 67, in main
    result = single_runner(config)
  File "experimentation/loop/experiment_runner.py", line 128, in run_iepy
    answers_given, progression = iepyloop.run_experiment()
  File "experimentation/loop/experiment_runner.py", line 76, in run_experiment
    self.force_process()  # blocking
  File "/home/jmansilla/projects/iepy/repo/iepy/core.py", line 181, in force_process
    self.do_iteration(None)
  File "/home/jmansilla/projects/iepy/repo/iepy/core.py", line 135, in do_iteration
    data = step(data)
  File "/home/jmansilla/projects/iepy/repo/iepy/core.py", line 266, in learn_fact_extractors
    classifiers[rel] = self._build_extractor(rel, Knowledge(k))
  File "/home/jmansilla/projects/iepy/repo/iepy/core.py", line 274, in _build_extractor
    return FactExtractorFactory(self.extractor_config, data)
  File "/home/jmansilla/projects/iepy/repo/iepy/fact_extractor.py", line 185, in FactExtractorFactory
    p.fit(data)
  File "/home/jmansilla/projects/iepy/repo/iepy/fact_extractor.py", line 158, in fit
    self.predictor.fit(X, y)
  File "/home/jmansilla/.virtualenvs/iepy/local/lib/python2.7/site-packages/sklearn/pipeline.py", line 131, in fit
    self.steps[-1][-1].fit(Xt, y, **fit_params)
  File "/home/jmansilla/.virtualenvs/iepy/local/lib/python2.7/site-packages/sklearn/svm/base.py", line 140, in fit
    y = self._validate_targets(y)
  File "/home/jmansilla/.virtualenvs/iepy/local/lib/python2.7/site-packages/sklearn/svm/base.py", line 442, in _validate_targets
    self.class_weight_ = compute_class_weight(self.class_weight, cls, y)
  File "/home/jmansilla/.virtualenvs/iepy/local/lib/python2.7/site-packages/sklearn/utils/class_weight.py", line 52, in compute_class_weight
    if classes[i] != c:
IndexError: index out of bounds

broken script cross_validate

Traceback (most recent call last):
File "scripts/cross_validate.py", line 114, in
accuracy, precision, recall = main(opts)
File "scripts/cross_validate.py", line 67, in main
standard = Knowledge.load_from_csv(options['<gold_standard>'], connection)
TypeError: load_from_csv() takes exactly 2 arguments (3 given)

Error downloading third party data

javier@my_computer:~/repo$ python scripts/download_third_party_data.py
Downloading third party software...
Downloading punkt tokenizer
[nltk_data] Downloading package 'punkt' to /home/javier/nltk_data...
[nltk_data] Unzipping tokenizers/punkt.zip.
Downloading wordnet
[nltk_data] Downloading package 'wordnet' to /home/javier/nltk_data...
[nltk_data] Unzipping corpora/wordnet.zip.
Traceback (most recent call last):
File "scripts/download_third_party_data.py", line 17, in
download_third_party_data()
File "scripts/download_third_party_data.py", line 11, in download_third_party_data
download_tagger()
File "/home/javier/repo/iepy/tagger.py", line 67, in download
os.mkdir(DIRS.user_data_dir)
OSError: [Errno 2] No such file or directory: '/home/javier/.config/iepy'

Running iepy_runner, by default use the extractor_config.json on the instance

Right now, if no --extractor-config option is passed, iepy will use the internal python defaults.

If I create an instance, it sounds a bit boilerplate the obligation of having to explicitely say

--extractor-config=extractor_config.json each time.

frequency_filter feature selection doesn't work with sparse matrices

Traceback (most recent call last):
File "/usr/lib/python2.7/runpy.py", line 162, in _run_module_as_main
"main", fname, loader, pkg_name)
File "/usr/lib/python2.7/runpy.py", line 72, in _run_code
exec code in run_globals
File "/home/francolq/Documents/comp/machinalis/experimental/iepy/scripts/iepy_runner.py", line 51, in
p.force_process()
File "iepy/core.py", line 182, in force_process
self.do_iteration(None)
File "iepy/core.py", line 136, in do_iteration
data = step(data)
File "iepy/core.py", line 267, in learn_fact_extractors
classifiers[rel] = self._build_extractor(rel, Knowledge(k))
File "iepy/core.py", line 277, in _build_extractor
return FactExtractorFactory(self.extractor_config, data)
File "iepy/fact_extractor.py", line 176, in FactExtractorFactory
p.fit(data)
File "iepy/fact_extractor.py", line 157, in fit
self.predictor.fit(X, y)
File "/usr/local/lib/python2.7/dist-packages/sklearn/pipeline.py", line 130, in fit
Xt, fit_params = self._pre_transform(X, y, *_fit_params)
File "/usr/local/lib/python2.7/dist-packages/sklearn/pipeline.py", line 122, in _pre_transform
Xt = transform.fit(Xt, y, *_fit_params_steps[name])
File "iepy/fact_extractor.py", line 429, in fit
if not any(self.mask):
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

packages missing in setup.py

Only "iepy" is listed.

wrong use of CombinedNERRunner in preprocess template

iepy/app_preprocess.py.template

RuntimeWarning: invalid value encountered in double_scalars

RuntimeWarning: invalid value encountered in double_scalars
scale = lambda x: (x - min_score) * range_delta / score_range + range_min

shall we worry about this?

Cannot use pre_process.py under mac os

Apparently the problem is in
corenlp.sh
which is sitting here:
/Users/dchaplinsky/Library/Application Support/iepy/stanford-corenlp-full-2014-08-27/corenlp.sh
and dirname $0 fails because of space in the path.

Fixed it by patching .sh to use dirname "$0"

I understand that this is more like StanfordNER problem, but might be my hack will be useful to somebody.

Decide if the database can be speeded up by adding indexes

When "speeding up" the database, the focus shall be put on the operations needed for running iepy core, not the preprocessing.

Generate reference corpus shall allow to continue existent reference

Raise warning if installing with python 2

Enhancement proposal by @makmanalp thanks for reporting!

"While installing, it took me a minute to realize that it was python 3 only, which is not a big deal, but it might help to have something in setup.py that errors out and gives you a message if you try to do that - or maybe there's a way in pypi to disallow from installing on python 2?"

preprocessing: override not working

Setting override=True in any step of the pipeline has no effect. I think the problem is in PreProcessPipeline, where get_documents_lacking_preprocess() is used regardless if we are trying to override.

Add Sphinx docs

Add Sphinx stuff to the docs, so we can integrate the documentation into readthedocs easily.

logging error in createdb template

Last parameter is missing:
logger.info('Database %s created with %i documents', dbname, )

"frequency filter" feature selection is ignoring feature_selection_dimension parameter

the features shouldn't include information about the involved entities

Features such as bag_of_words, bag_of_pos, bag_of_wordpos and their bigram versions shouldn't include the two entities involved in the evidence.

Bad behavior if only positive or only negative evidence present

Classification fails (is impossible) if only negative (or only positive) evidence is present.
Right now, a fail (an un-captured exception) raises if only negative examples are present.

This issue has not been consciously addressed and correct behavior should be ensured.

"What is correct behavior?" is open for discussion, what I propose is: In all stages of the pipeline except the human interaction (stage 2) skip relations that don't have positive and negative evidence.

Dead code to remove

the method _confidence of class BootstrappedIEPipeline is not being used

Should we remove it?

bootstrap uses wrong entity types starting from 2nd iteration

I start with seeds of type (disease, symptom) but in the second round of questions IEPY starts asking about (disease, disease).

In my case I run with the command:
$ python scripts/iepy_runner.py house_pages_current seeds.csv facts.csv

Rules are not being loaded

There's a re-definition of the "rules" variable at:

https://github.com/machinalis/iepy/blob/master/iepy/instantiation/iepy_rules_runner.py#L43

This means rules are not being loaded

Add lemmas/stemming calculated by CoreNLP to database

Some features of the classifier in fact_extraction require stemming and/or lemmatization.
This is currently done using nltk.
Since CoreNLP already calculates this information during preprocessing it would be more efficient to save that information into the database and then reuse it (instead of re-calculatting it).

Add more logging to core.py

That, adding more logging to the pipeline in core.py would be useful

Change format of knowledge on CSV so it can survive NER changes on the database

Cannot start core without candidate evidence

Hi,

I'm interested in the whole subject matter of iepy, but the vocabulary and theories attached to NLP are very new to me. My apologies if what I'm asking shows a complete lack of understanding.

I've installed iepy, and created a simple .csv file and imported it:

$ python bin/csv_to_iepy.py docs.csv 

Importing Documents to IEPY from docs.csv
Added 1 documents
Added 2 documents

Then preprocessed:

$ python bin/preprocess.py

Starting preprocessing step <iepy.preprocess.stanford_preprocess.StanfordPreprocess object at 0x7ff43878ce90>
Loading StanfordCoreNLP...
    Done for 1 documents
    Done for 2 documents
Starting preprocessing step <iepy.preprocess.segmenter.SyntacticSegmenterRunner object at 0x7ff43878cf90>
About to set 1 segments for current doc
New 1 segments created
    Done for 1 documents
About to set 1 segments for current doc
New 1 segments created
    Done for 2 documents

Using the web interface, I created a relation called rvb and an entity kind Èntity` (to which the relation is attached to, left and right).

If I then run

$  python bin/iepy_runner.py rvb

I get

Loading candidate evidence from database...
Getting labels from DB
Sorting labels them by evidence
Labels conflict solving
Traceback (most recent call last):
    File "bin/iepy_runner.py", line 71, in <module>
        performance_tradeoff=tuning_mode)
    File "/home/mathieu/dev/iepy-test/lib/python3.3/site-packages/iepy/extraction/active_learning_core.py", line 48, in __init__
    self._setup_labeled_evidences(labeled_evidences)
  File "/home/mathieu/dev/iepy-test/lib/python3.3/site-packages/iepy/extraction/active_learning_core.py", line 162, in _setup_labeled_evidences
    raise ValueError("Cannot start core without candidate evidence")
ValueError: Cannot start core without candidate evidence

As I said, this is probably due to my complete lack of understanding of the theories behind IEPY, but how can I create candidate evidence?

LiteralNER shall tokenize each entry

Right now our LiteralNER is very literal, so in some cases is not working.

Example: an entry like this

takayasu's arteritis

Is never found because the documents will be tokenized, transforming this

John had takayasu's arteritis

into this

John had takayasu 's arteritis

making impossible a match (notice that 's is a separated token).

Also, what make things harder is that the tokenizer to use while parsing the LiteralNER entries must be the same tokenizer used when tokenizing text.

Document warning about sqlite performance

By default a iepy --create instantiation uses a sqlite database. Since the performance of this database is awful it would be a good thing to warn the user about this in the documentation.

Bootstrap adds to knowledge evidence that was labeled as negative by the user

The filter_facts step adds to knowledge the evidence that was classified as positive with high probability but doesn't check if the evidence was already labeled as negative by the user.

tokenizer not correctly splitting some contractions

The tokenizer is not following standard contraction tokenization [0], expected by the Stanford POS tagger. Contractions are not splitted and should be.

Also, the apostrophe character ´ is not handled.

[0] http://www.cis.upenn.edu/~treebank/tokenization.html

scripts and templates not usable when installing with pip+git

Installing IEPY with the following command
$ pip install git+https://github.com/machinalis/iepy
makes scripts difficult to access and templates directly unavailable.

Maybe a single binary 'iepy' could be provided to give access to the scripts.

Generate seeds scripts suggest reflexive relationships

Running:

python scripts/generate_seeds.py <dbname> person person output.csv

Suggests things like the one in the attached screenshot, which is silly and IMHO should be avoided.

(reported by jmansilla)

Mention bugtrucker (this one on github) on setup.py

As an example, this package is doing it

https://pypi.python.org/pypi/demiurge

Script for downloading 3rd party data shall check if each download is needed

Right now download_third_party_data.py re-download data without checking if it's necessary. It would be great if it only downloaded what is necessary.

(reported by jmansilla)

generalize_knowledge never ends with no that big datasets

It's accessing the DB too many times. Refactor needed.

IEPY runner outputs some facts with no evidence

Specifically, those facts that are seed facts. For instance:

disease,botulism,symptom,paralysis,CAUSES,,,,,1

Replace deprecated code when using MongoEngine

Running the preprocess, or the tests under python 2 logs:

DeprecationWarning: get_or_create is scheduled to be deprecated. The approach is flawed without transactions. Upserts should be preferred.

add a public list and some discussion channel

Arrow direction in corpora construction UI

Hi, I've noticed that the direction of the arrows when you are building a corpus (labeling by hand) depends on the order you click the entity ocurrences but has no relation with the order that the entity kinds have in a relation.

I don't know if it's intentional or not, I just noticed it and wanted to leave record.

Big texts are breaking import process

I've tried to import my corpus of ukrainian texts and apparently one of them was too big for iepy:

Added 2503 documents
Traceback (most recent call last):
  File "bin/csv_to_iepy.py", line 29, in <module>
    csv_to_iepy(filepath)
  File "/Users/dchaplinsky/Projects/pullenti-ukr/iepy/venv/lib/python3.4/site-packages/iepy/utils.py", line 111, in csv_to_iepy
    for i, d in enumerate(reader):
  File "/usr/local/Cellar/python3/3.4.1_1/Frameworks/Python.framework/Versions/3.4/lib/python3.4/csv.py", line 110, in __next__
    row = next(self.reader)
_csv.Error: field larger than field limit (131072)

While I understand that having such a text in corpus is a bit stupid I think good solution here would be:

Capture exception
Show warning
Continue import

Download 3rd party downloads things that are not used

The download 3rd party scripts downloads additional packages that are not needed, like:

Stanford NER
Stanford postagger
nltk stuff (perhaps not needed, really not sure).

The script should only download what is actually used by IEPY.

cannot inherit from IEDocument

In my own app I am trying to subclass IEDocument to have additional fields and methods, but mongoengine says I can't do it. Not sure if bug or design decision, but it would be useful in my case to be able to subclass IEDocument.

For instance:

from iepy.models import IEDocument
class MyDocument(IEDocument):
... pass
...
Traceback (most recent call last):
File "", line 1, in
File "/usr/local/lib/python2.7/dist-packages/mongoengine/base/metaclasses.py", line 332, in new
new_class = super_new(cls, name, bases, attrs)
File "/usr/local/lib/python2.7/dist-packages/mongoengine/base/metaclasses.py", line 120, in new
base.name)
ValueError: Document IEDocument may not be subclassed

Document Corpus Navigation

On documentation is not completely clear that once you run the preprocess you can already browse and edit the processed data.

Generate seeds silently do nothing with invalid entity kind provided

Would be great to add the following features to the generate_seeds script:

Option --list-entity-kinds that prints the available entity-kinds
When a non existent entity kind is provided, print a message saying so and suggesting to list them with the option mentioned above.

(reported by jmansilla)

Add ability to override Entity Ocurrences

In our example tvseries app, the word "House" which refers to the person is instead labeled as organization by the default NER[1] we have

[1] probably because our NER was trained with the WallStreetJournal so it may refer to the White House... dunno.

rules runner does not take a relation parameter

Every other script on the instances takes a parameter for the relation used, but the rules runner takes it from the rules.py file.

For uniformity purposes we should think how can we change that, having in mind that a set of rules is specific to a relation.

I open this to discussion

Problems with relations rendering in FF 33.0.3/Mac

Looks like it's related to window width somehow.

bootstrap fails if every evidence is labeled as False

The bug can be reproduced answering 'n' to a question and then 'run'.

In my case:
$ python scripts/iepy_runner.py house_pages_current seeds.csv facts.csv
...
(y/n/d/run/STOP): n
...
(y/n/d/run/STOP): run
...
Traceback (most recent call last):
File "scripts/iepy_runner.py", line 40, in
p.force_process()
File "/home/francolq/Documents/comp/machinalis/iepy/iepy/core.py", line 318, in force_process
self.do_iteration(None)
File "/home/francolq/Documents/comp/machinalis/iepy/iepy/core.py", line 281, in do_iteration
data = step(data)
File "/home/francolq/Documents/comp/machinalis/iepy/iepy/core.py", line 401, in extract_facts
true_index = list(classifier.classes_).index(True)
ValueError: True is not in list

IEPY raises ValueError ColumnFilter eliminates all columns!

Classifier Configuration:

    "classifier_config": {
        "classifier": "svm",
        "classifier_args": {
            "class_weight": {
                "false": 1,
                "true": 1
            },
            "gamma": 0.0,
            "kernel": "rbf"
        },
        "dimensionality_reduction": null,
        "dimensionality_reduction_dimension": null,
        "feature_selection": "frequency_filter",
        "feature_selection_dimension": 10,
        "features": [
            "bag_of_words_in_between",
            "bag_of_pos_in_between",
            "bag_of_wordpos_in_between",
            "entity_order",
            "entity_distance",
            "other_entities_in_between",
            "verbs_count_in_between"
        ],
        "scaler": true,
        "sparse": true
    },

Full traceback:

Processing |                                | 2/2882014-06-09 13:34:29,163 - root - ERROR - Experiment failed because of ValueError ColumnFilter eliminates all columns!, skipping...
Traceback (most recent call last):
  File "experimentation/loop/experiment_runner.py", line 338, in <module>
    use_git_info_from_path=path)
  File "/home/jmansilla/.virtualenvs/iepy/local/lib/python2.7/site-packages/featureforge/experimentation/runner.py", line 67, in main
    result = single_runner(config)
  File "experimentation/loop/experiment_runner.py", line 128, in run_iepy
    answers_given, progression = iepyloop.run_experiment()
  File "experimentation/loop/experiment_runner.py", line 76, in run_experiment
    self.force_process()  # blocking
  File "/home/jmansilla/projects/iepy/repo/iepy/core.py", line 181, in force_process
    self.do_iteration(None)
  File "/home/jmansilla/projects/iepy/repo/iepy/core.py", line 135, in do_iteration
    data = step(data)
  File "/home/jmansilla/projects/iepy/repo/iepy/core.py", line 266, in learn_fact_extractors
    classifiers[rel] = self._build_extractor(rel, Knowledge(k))
  File "/home/jmansilla/projects/iepy/repo/iepy/core.py", line 274, in _build_extractor
    return FactExtractorFactory(self.extractor_config, data)
  File "/home/jmansilla/projects/iepy/repo/iepy/fact_extractor.py", line 185, in FactExtractorFactory
    p.fit(data)
  File "/home/jmansilla/projects/iepy/repo/iepy/fact_extractor.py", line 158, in fit
    self.predictor.fit(X, y)
  File "/home/jmansilla/.virtualenvs/iepy/local/lib/python2.7/site-packages/sklearn/pipeline.py", line 130, in fit
    Xt, fit_params = self._pre_transform(X, y, **fit_params)
  File "/home/jmansilla/.virtualenvs/iepy/local/lib/python2.7/site-packages/sklearn/pipeline.py", line 122, in _pre_transform
    Xt = transform.fit(Xt, y, **fit_params_steps[name]) \
  File "/home/jmansilla/projects/iepy/repo/iepy/fact_extractor.py", line 442, in fit
    raise ValueError("ColumnFilter eliminates all columns!")
ValueError: ColumnFilter eliminates all columns!

Cache evidence vectorization to improve speed

Once the BootstrappedIEPipeline starts running the Evidence instances generated are (kind of) read-only.

The same evidences will be generated over and over in stage 1. Therefore, the vectorization needed for the classifier at stages 3 and 5 could be cached for future pipeline cycles, improving the execution speed.

machinalis / iepy Goto Github PK

iepy's Introduction

IEPY

Features

Installation

Running the tests

Learn more

Authors

iepy's People

Contributors

Stargazers

Watchers

Forkers

iepy's Issues

Recommend Projects

Recommend Topics

Recommend Org