Giter VIP home page Giter VIP logo

iepy's Introduction

IEPY

IEPY is an open source tool for Information Extraction focused on Relation Extraction.

To give an example of Relation Extraction, if we are trying to find a birth date in:

"John von Neumann (December 28, 1903 – February 8, 1957) was a Hungarian and American pure and applied mathematician, physicist, inventor and polymath."

then IEPY's task is to identify "John von Neumann" and "December 28, 1903" as the subject and object entities of the "was born in" relation.

It's aimed at:
  • users needing to perform Information Extraction on a large dataset.
  • scientists wanting to experiment with new IE algorithms.

Features

Installation

Install the required packages:

sudo apt-get install build-essential python3-dev liblapack-dev libatlas-dev gfortran openjdk-7-jre

Then simply install with pip:

pip install iepy

Full details about the installation is available on the Read the Docs page.

Running the tests

If you are contributing to the project and want to run the tests, all you have to do is:

Learn more

The full documentation is available on Read the Docs.

Authors

IEPY is © 2014 Machinalis in collaboration with the NLP Group at UNC-FaMAF. Its primary authors are:

You can follow the development of this project and report issues at http://github.com/machinalis/iepy

You can join the mailing list here

iepy's People

Contributors

copy-bin avatar dmoisset avatar ezesalta avatar francolq avatar ganiserb avatar j0hn avatar jmansilla avatar lowks avatar marcossponton avatar mspandit avatar rafacarrascosa avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

iepy's Issues

Import/Export of annotated corpora

It would be great to have a tool to import and export annotated corpora.
That would allow us to easily share the corpora that we already have (perdate and orgloc) and to receive contributions from future IEPY users.

IEPY raises IndexError sometimes when classifier is using weighted classes

Classifier configuration:

    "classifier_config": {
        "classifier": "svm",
        "classifier_args": {
            "class_weight": {
                "false": 1,
                "true": 1
            },
            "gamma": 0.0,
            "kernel": "rbf"
        },
        "dimensionality_reduction": null,
        "dimensionality_reduction_dimension": null,
        "feature_selection": "frequency_filter",
        "feature_selection_dimension": 5,
        "features": [
            "bag_of_words_in_between",
            "bag_of_pos_in_between",
            "bag_of_wordpos_in_between",
            "entity_order",
            "entity_distance",
            "other_entities_in_between",
            "verbs_count_in_between"
        ],
        "scaler": true,
        "sparse": true
    },

Full traceback:

Processing |                                | 1/2882014-06-09 13:27:39,152 - root - ERROR - Experiment failed because of IndexError index out of bounds, skipping...
Traceback (most recent call last):
  File "experimentation/loop/experiment_runner.py", line 338, in <module>
    use_git_info_from_path=path)
  File "/home/jmansilla/.virtualenvs/iepy/local/lib/python2.7/site-packages/featureforge/experimentation/runner.py", line 67, in main
    result = single_runner(config)
  File "experimentation/loop/experiment_runner.py", line 128, in run_iepy
    answers_given, progression = iepyloop.run_experiment()
  File "experimentation/loop/experiment_runner.py", line 76, in run_experiment
    self.force_process()  # blocking
  File "/home/jmansilla/projects/iepy/repo/iepy/core.py", line 181, in force_process
    self.do_iteration(None)
  File "/home/jmansilla/projects/iepy/repo/iepy/core.py", line 135, in do_iteration
    data = step(data)
  File "/home/jmansilla/projects/iepy/repo/iepy/core.py", line 266, in learn_fact_extractors
    classifiers[rel] = self._build_extractor(rel, Knowledge(k))
  File "/home/jmansilla/projects/iepy/repo/iepy/core.py", line 274, in _build_extractor
    return FactExtractorFactory(self.extractor_config, data)
  File "/home/jmansilla/projects/iepy/repo/iepy/fact_extractor.py", line 185, in FactExtractorFactory
    p.fit(data)
  File "/home/jmansilla/projects/iepy/repo/iepy/fact_extractor.py", line 158, in fit
    self.predictor.fit(X, y)
  File "/home/jmansilla/.virtualenvs/iepy/local/lib/python2.7/site-packages/sklearn/pipeline.py", line 131, in fit
    self.steps[-1][-1].fit(Xt, y, **fit_params)
  File "/home/jmansilla/.virtualenvs/iepy/local/lib/python2.7/site-packages/sklearn/svm/base.py", line 140, in fit
    y = self._validate_targets(y)
  File "/home/jmansilla/.virtualenvs/iepy/local/lib/python2.7/site-packages/sklearn/svm/base.py", line 442, in _validate_targets
    self.class_weight_ = compute_class_weight(self.class_weight, cls, y)
  File "/home/jmansilla/.virtualenvs/iepy/local/lib/python2.7/site-packages/sklearn/utils/class_weight.py", line 52, in compute_class_weight
    if classes[i] != c:
IndexError: index out of bounds

broken script cross_validate

Traceback (most recent call last):
File "scripts/cross_validate.py", line 114, in
accuracy, precision, recall = main(opts)
File "scripts/cross_validate.py", line 67, in main
standard = Knowledge.load_from_csv(options['<gold_standard>'], connection)
TypeError: load_from_csv() takes exactly 2 arguments (3 given)

Error downloading third party data

javier@my_computer:~/repo$ python scripts/download_third_party_data.py
Downloading third party software...
Downloading punkt tokenizer
[nltk_data] Downloading package 'punkt' to /home/javier/nltk_data...
[nltk_data] Unzipping tokenizers/punkt.zip.
Downloading wordnet
[nltk_data] Downloading package 'wordnet' to /home/javier/nltk_data...
[nltk_data] Unzipping corpora/wordnet.zip.
Traceback (most recent call last):
File "scripts/download_third_party_data.py", line 17, in
download_third_party_data()
File "scripts/download_third_party_data.py", line 11, in download_third_party_data
download_tagger()
File "/home/javier/repo/iepy/tagger.py", line 67, in download
os.mkdir(DIRS.user_data_dir)
OSError: [Errno 2] No such file or directory: '/home/javier/.config/iepy'

frequency_filter feature selection doesn't work with sparse matrices

Traceback (most recent call last):
File "/usr/lib/python2.7/runpy.py", line 162, in _run_module_as_main
"main", fname, loader, pkg_name)
File "/usr/lib/python2.7/runpy.py", line 72, in _run_code
exec code in run_globals
File "/home/francolq/Documents/comp/machinalis/experimental/iepy/scripts/iepy_runner.py", line 51, in
p.force_process()
File "iepy/core.py", line 182, in force_process
self.do_iteration(None)
File "iepy/core.py", line 136, in do_iteration
data = step(data)
File "iepy/core.py", line 267, in learn_fact_extractors
classifiers[rel] = self._build_extractor(rel, Knowledge(k))
File "iepy/core.py", line 277, in _build_extractor
return FactExtractorFactory(self.extractor_config, data)
File "iepy/fact_extractor.py", line 176, in FactExtractorFactory
p.fit(data)
File "iepy/fact_extractor.py", line 157, in fit
self.predictor.fit(X, y)
File "/usr/local/lib/python2.7/dist-packages/sklearn/pipeline.py", line 130, in fit
Xt, fit_params = self._pre_transform(X, y, *_fit_params)
File "/usr/local/lib/python2.7/dist-packages/sklearn/pipeline.py", line 122, in _pre_transform
Xt = transform.fit(Xt, y, *_fit_params_steps[name])
File "iepy/fact_extractor.py", line 429, in fit
if not any(self.mask):
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

Cannot use pre_process.py under mac os

Apparently the problem is in
corenlp.sh
which is sitting here:
/Users/dchaplinsky/Library/Application Support/iepy/stanford-corenlp-full-2014-08-27/corenlp.sh
and dirname $0 fails because of space in the path.

Fixed it by patching .sh to use dirname "$0"

I understand that this is more like StanfordNER problem, but might be my hack will be useful to somebody.

Raise warning if installing with python 2

Enhancement proposal by @makmanalp thanks for reporting!

"While installing, it took me a minute to realize that it was python 3 only, which is not a big deal, but it might help to have something in setup.py that errors out and gives you a message if you try to do that - or maybe there's a way in pypi to disallow from installing on python 2?"

preprocessing: override not working

Setting override=True in any step of the pipeline has no effect. I think the problem is in PreProcessPipeline, where get_documents_lacking_preprocess() is used regardless if we are trying to override.

Add Sphinx docs

Add Sphinx stuff to the docs, so we can integrate the documentation into readthedocs easily.

Bad behavior if only positive or only negative evidence present

Classification fails (is impossible) if only negative (or only positive) evidence is present.
Right now, a fail (an un-captured exception) raises if only negative examples are present.

This issue has not been consciously addressed and correct behavior should be ensured.

"What is correct behavior?" is open for discussion, what I propose is: In all stages of the pipeline except the human interaction (stage 2) skip relations that don't have positive and negative evidence.

Dead code to remove

the method _confidence of class BootstrappedIEPipeline is not being used

Should we remove it?

Add lemmas/stemming calculated by CoreNLP to database

Some features of the classifier in fact_extraction require stemming and/or lemmatization.
This is currently done using nltk.
Since CoreNLP already calculates this information during preprocessing it would be more efficient to save that information into the database and then reuse it (instead of re-calculatting it).

Cannot start core without candidate evidence

Hi,

I'm interested in the whole subject matter of iepy, but the vocabulary and theories attached to NLP are very new to me. My apologies if what I'm asking shows a complete lack of understanding.

I've installed iepy, and created a simple .csv file and imported it:

$ python bin/csv_to_iepy.py docs.csv 

Importing Documents to IEPY from docs.csv
Added 1 documents
Added 2 documents

Then preprocessed:

$ python bin/preprocess.py

Starting preprocessing step <iepy.preprocess.stanford_preprocess.StanfordPreprocess object at 0x7ff43878ce90>
Loading StanfordCoreNLP...
    Done for 1 documents
    Done for 2 documents
Starting preprocessing step <iepy.preprocess.segmenter.SyntacticSegmenterRunner object at 0x7ff43878cf90>
About to set 1 segments for current doc
New 1 segments created
    Done for 1 documents
About to set 1 segments for current doc
New 1 segments created
    Done for 2 documents

Using the web interface, I created a relation called rvb and an entity kind Èntity` (to which the relation is attached to, left and right).

If I then run

$  python bin/iepy_runner.py rvb

I get

Loading candidate evidence from database...
Getting labels from DB
Sorting labels them by evidence
Labels conflict solving
Traceback (most recent call last):
    File "bin/iepy_runner.py", line 71, in <module>
        performance_tradeoff=tuning_mode)
    File "/home/mathieu/dev/iepy-test/lib/python3.3/site-packages/iepy/extraction/active_learning_core.py", line 48, in __init__
    self._setup_labeled_evidences(labeled_evidences)
  File "/home/mathieu/dev/iepy-test/lib/python3.3/site-packages/iepy/extraction/active_learning_core.py", line 162, in _setup_labeled_evidences
    raise ValueError("Cannot start core without candidate evidence")
ValueError: Cannot start core without candidate evidence

As I said, this is probably due to my complete lack of understanding of the theories behind IEPY, but how can I create candidate evidence?

LiteralNER shall tokenize each entry

Right now our LiteralNER is very literal, so in some cases is not working.

Example: an entry like this

takayasu's arteritis

Is never found because the documents will be tokenized, transforming this

John had takayasu's arteritis

into this

John had takayasu 's arteritis

making impossible a match (notice that 's is a separated token).

Also, what make things harder is that the tokenizer to use while parsing the LiteralNER entries must be the same tokenizer used when tokenizing text.

Document warning about sqlite performance

By default a iepy --create instantiation uses a sqlite database. Since the performance of this database is awful it would be a good thing to warn the user about this in the documentation.

Replace deprecated code when using MongoEngine

Running the preprocess, or the tests under python 2 logs:

DeprecationWarning: get_or_create is scheduled to be deprecated. The approach is flawed without transactions. Upserts should be preferred.

Arrow direction in corpora construction UI

Hi, I've noticed that the direction of the arrows when you are building a corpus (labeling by hand) depends on the order you click the entity ocurrences but has no relation with the order that the entity kinds have in a relation.

I don't know if it's intentional or not, I just noticed it and wanted to leave record.

Big texts are breaking import process

I've tried to import my corpus of ukrainian texts and apparently one of them was too big for iepy:

Added 2503 documents
Traceback (most recent call last):
  File "bin/csv_to_iepy.py", line 29, in <module>
    csv_to_iepy(filepath)
  File "/Users/dchaplinsky/Projects/pullenti-ukr/iepy/venv/lib/python3.4/site-packages/iepy/utils.py", line 111, in csv_to_iepy
    for i, d in enumerate(reader):
  File "/usr/local/Cellar/python3/3.4.1_1/Frameworks/Python.framework/Versions/3.4/lib/python3.4/csv.py", line 110, in __next__
    row = next(self.reader)
_csv.Error: field larger than field limit (131072)

While I understand that having such a text in corpus is a bit stupid I think good solution here would be:

  • Capture exception
  • Show warning
  • Continue import

Download 3rd party downloads things that are not used

The download 3rd party scripts downloads additional packages that are not needed, like:

  • Stanford NER
  • Stanford postagger
  • nltk stuff (perhaps not needed, really not sure).

The script should only download what is actually used by IEPY.

cannot inherit from IEDocument

In my own app I am trying to subclass IEDocument to have additional fields and methods, but mongoengine says I can't do it. Not sure if bug or design decision, but it would be useful in my case to be able to subclass IEDocument.

For instance:

from iepy.models import IEDocument
class MyDocument(IEDocument):
... pass
...
Traceback (most recent call last):
File "", line 1, in
File "/usr/local/lib/python2.7/dist-packages/mongoengine/base/metaclasses.py", line 332, in new
new_class = super_new(cls, name, bases, attrs)
File "/usr/local/lib/python2.7/dist-packages/mongoengine/base/metaclasses.py", line 120, in new
base.name)
ValueError: Document IEDocument may not be subclassed

Document Corpus Navigation

On documentation is not completely clear that once you run the preprocess you can already browse and edit the processed data.

Generate seeds silently do nothing with invalid entity kind provided

Would be great to add the following features to the generate_seeds script:

  • Option --list-entity-kinds that prints the available entity-kinds
  • When a non existent entity kind is provided, print a message saying so and suggesting to list them with the option mentioned above.

(reported by jmansilla)

Add ability to override Entity Ocurrences

In our example tvseries app, the word "House" which refers to the person is instead labeled as organization by the default NER[1] we have

[1] probably because our NER was trained with the WallStreetJournal so it may refer to the White House... dunno.

rules runner does not take a relation parameter

Every other script on the instances takes a parameter for the relation used, but the rules runner takes it from the rules.py file.

For uniformity purposes we should think how can we change that, having in mind that a set of rules is specific to a relation.

I open this to discussion

bootstrap fails if every evidence is labeled as False

The bug can be reproduced answering 'n' to a question and then 'run'.

In my case:
$ python scripts/iepy_runner.py house_pages_current seeds.csv facts.csv
...
(y/n/d/run/STOP): n
...
(y/n/d/run/STOP): run
...
Traceback (most recent call last):
File "scripts/iepy_runner.py", line 40, in
p.force_process()
File "/home/francolq/Documents/comp/machinalis/iepy/iepy/core.py", line 318, in force_process
self.do_iteration(None)
File "/home/francolq/Documents/comp/machinalis/iepy/iepy/core.py", line 281, in do_iteration
data = step(data)
File "/home/francolq/Documents/comp/machinalis/iepy/iepy/core.py", line 401, in extract_facts
true_index = list(classifier.classes_).index(True)
ValueError: True is not in list

IEPY raises ValueError ColumnFilter eliminates all columns!

Classifier Configuration:

    "classifier_config": {
        "classifier": "svm",
        "classifier_args": {
            "class_weight": {
                "false": 1,
                "true": 1
            },
            "gamma": 0.0,
            "kernel": "rbf"
        },
        "dimensionality_reduction": null,
        "dimensionality_reduction_dimension": null,
        "feature_selection": "frequency_filter",
        "feature_selection_dimension": 10,
        "features": [
            "bag_of_words_in_between",
            "bag_of_pos_in_between",
            "bag_of_wordpos_in_between",
            "entity_order",
            "entity_distance",
            "other_entities_in_between",
            "verbs_count_in_between"
        ],
        "scaler": true,
        "sparse": true
    },

Full traceback:

Processing |                                | 2/2882014-06-09 13:34:29,163 - root - ERROR - Experiment failed because of ValueError ColumnFilter eliminates all columns!, skipping...
Traceback (most recent call last):
  File "experimentation/loop/experiment_runner.py", line 338, in <module>
    use_git_info_from_path=path)
  File "/home/jmansilla/.virtualenvs/iepy/local/lib/python2.7/site-packages/featureforge/experimentation/runner.py", line 67, in main
    result = single_runner(config)
  File "experimentation/loop/experiment_runner.py", line 128, in run_iepy
    answers_given, progression = iepyloop.run_experiment()
  File "experimentation/loop/experiment_runner.py", line 76, in run_experiment
    self.force_process()  # blocking
  File "/home/jmansilla/projects/iepy/repo/iepy/core.py", line 181, in force_process
    self.do_iteration(None)
  File "/home/jmansilla/projects/iepy/repo/iepy/core.py", line 135, in do_iteration
    data = step(data)
  File "/home/jmansilla/projects/iepy/repo/iepy/core.py", line 266, in learn_fact_extractors
    classifiers[rel] = self._build_extractor(rel, Knowledge(k))
  File "/home/jmansilla/projects/iepy/repo/iepy/core.py", line 274, in _build_extractor
    return FactExtractorFactory(self.extractor_config, data)
  File "/home/jmansilla/projects/iepy/repo/iepy/fact_extractor.py", line 185, in FactExtractorFactory
    p.fit(data)
  File "/home/jmansilla/projects/iepy/repo/iepy/fact_extractor.py", line 158, in fit
    self.predictor.fit(X, y)
  File "/home/jmansilla/.virtualenvs/iepy/local/lib/python2.7/site-packages/sklearn/pipeline.py", line 130, in fit
    Xt, fit_params = self._pre_transform(X, y, **fit_params)
  File "/home/jmansilla/.virtualenvs/iepy/local/lib/python2.7/site-packages/sklearn/pipeline.py", line 122, in _pre_transform
    Xt = transform.fit(Xt, y, **fit_params_steps[name]) \
  File "/home/jmansilla/projects/iepy/repo/iepy/fact_extractor.py", line 442, in fit
    raise ValueError("ColumnFilter eliminates all columns!")
ValueError: ColumnFilter eliminates all columns!

Cache evidence vectorization to improve speed

Once the BootstrappedIEPipeline starts running the Evidence instances generated are (kind of) read-only.

The same evidences will be generated over and over in stage 1. Therefore, the vectorization needed for the classifier at stages 3 and 5 could be cached for future pipeline cycles, improving the execution speed.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.