bllip / bllip-parser Goto Github PK

This project forked from dmcc/bllip-parser

BLLIP reranking parser (also known as Charniak-Johnson parser, Charniak parser, Brown reranking parser) See http://pypi.python.org/pypi/bllipparser/ for Python module.

Home Page: http://bllip.cs.brown.edu/

Makefile 0.13% Shell 0.05% C 1.76% Ruby 0.01% Assembly 0.01% GAP 91.69% C++ 5.59% Java 0.02% Python 0.66% Objective-C 0.01% Lex 0.05% Dockerfile 0.01% R 0.01% Raku 0.01% SWIG 0.04%

natural-language-processing parsing machine-learning nlp nlp-library ai artificial-intelligence computational-linguistics

bllip-parser's Introduction

BLLIP Reranking Parser

https://travis-ci.org/BLLIP/bllip-parser.png?branch=master

https://badge.fury.io/py/bllipparser.png

We request acknowledgement in any publications that make use of this software and any code derived from this software. Please report the release date of the software that you are using, as this will enable others to compare their results to yours.

Overview

BLLIP Parser is a statistical natural language parser including a generative constituent parser (first-stage) and discriminative maximum entropy reranker (second-stage). The latest version can be found on GitHub. This document describes basic usage of the command line interface and describes how to build and run the reranking parser. There are now Python and Java interfaces as well. The Python interface is described in README-python.rst.

Compiling the parser

(optional) For optimal speed, you may want to define $GCCFLAGS specifically for your machine. However, this step can be safely skipped as the defaults are usually fine. With csh or tcsh, try something like:
```
shell> setenv GCCFLAGS "-march=pentium4 -mfpmath=sse -msse2 -mmmx"
```
or:
```
shell> setenv GCCFLAGS "-march=opteron -m64"
```
Build the parser with:
```
shell> make
```
- Sidenote on compiling on OS X
  
  OS X uses the clang compiler by default which cannot currently compile the parser. Try setting this environment variable before building to change the default C++ compiler:
```
shell> setenv CXX g++
```
  Recent versions of OS X may have additional issues. See issues 60, 19, and 13 for more information.

Obtaining parser models

The GitHub repository includes parsing and reranker models, though these are mostly around for historical purposes. See this page on BLLIP Parser models for information about obtaining newer and more accurate parsing models.

Running the parser

After it has been built, the parser can be run with:

shell> parse.sh <sourcefile.txt>

For example:

shell> parse.sh sample-text/sample-data.txt

The input text must be pre-sentence segmented with each sentence in an <s> tag:

<s> Sentence 1 </s>
<s> Sentence 2 </s>
...

Note that there needs to be a space before and after the sentence.

The parser distribution currently includes a basic Penn Treebank Wall Street Journal parsing models which parse.sh will use by default. The Python interface to the parser includes a mechanism for listing and downloading additional parsing models (some of which are more accurate, depending on what you're parsing).

The script parse-and-fuse.sh demonstrates how to run syntactic parse fusion. Fusion can also be run via the Python bindings.

The script parse-eval.sh takes a list of treebank files as arguments and extracts the terminal strings from them, runs the two-stage parser on those terminal strings and then evaluates the parsing accuracy with Sparseval. For example, if the Penn Treebank 3 is installed at /usr/local/data/Penn3/, the following code evaluates the two-stage parser on section 24:

shell> parse-eval.sh /usr/local/data/Penn3/parsed/mrg/wsj/24/wsj*.mrg

The Makefile will attempt to automatically download and build Sparseval for you if you run make sparseval.

For more information on Sparseval see this paper:

@inproceedings{roark2006sparseval,
    title={SParseval: Evaluation metrics for parsing speech},
    author={Roark, Brian and Harper, Mary and Charniak, Eugene and
            Dorr, Bonnie and Johnson, Mark and Kahn, Jeremy G and
            Liu, Yang and Ostendorf, Mari and Hale, John and
            Krasnyanskaya, Anna and others},
    booktitle={Proceedings of LREC},
    year={2006}
}

We no longer distribute evalb with the parser since it sometimes skips sentences unnecessarily. Sparseval does not have these issues.

bllip-parser's People

Contributors

Stargazers

Watchers

bllip-parser's Issues

Run first stage code through some code formatter

Formatting should be standardized. Should look at http://stackoverflow.com/questions/841075/best-c-code-formatter-beautifier

Handling quotes

Given the text
John said, "Welcome to the heaven".
rrp.simple_parse gives
(S1 (S (NP (NNP John)) (VP (VBD said) (, ,) (`` ``) (INTJ (UH Welcome) (PP (TO to) (NP (DT the) (NN heaven)))) ('' '')) (. .)))
If I use rrp.parse_tagged with the following tokens and postags

tokens=[u'John', u'said', u',', u'"', u'Welcome', u'to', u'the', u'heaven', u'"', u'.']
postags={0: u'NNP', 1: u'VBD', 2: u',', 3: u'``', 4: u'UH', 5: u'TO', 6: u'DT', 7: u'NN', 8: u"''", 9: u'.'}

it returns an empty list.

Workaround: In tokens, if I change the beginning double quotes to two backticks and ending double quotes to two apostrophe, as
tokens=[u'John', u'said', u',', u'``', u'Welcome', u'to', u'the', u'heaven', u"''", u'.']
then it works.

File missing in 20150708 python releasing

first-stage/PARSE/SimpleAPI.C is missing in the release.

python import failing with undefined symbol

Hi,

I've installed and re-installed bllipparser a few times on Ubuntu Xenial 16.04 and I keep getting the same import errors in python (anaconda). Has anyone else seen this issue? Here is my attempt at an import:

Python 2.7.13 |Anaconda 4.4.0 (64-bit)| (default, Dec 20 2016, 23:09:15)
[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
Anaconda is brought to you by Continuum Analytics.
Please check out: http://continuum.io/thanks and https://anaconda.org

import bllipparser
Traceback (most recent call last):
File "", line 1, in
File "/home/ubuntu/anaconda2/lib/python2.7/site-packages/bllipparser/init.py", line 399, in
from .RerankingParser import RerankingParser, Tree, Sentence, tokenize
File "/home/ubuntu/anaconda2/lib/python2.7/site-packages/bllipparser/RerankingParser.py", line 19, in
from . import CharniakParser as parser
File "/home/ubuntu/anaconda2/lib/python2.7/site-packages/bllipparser/CharniakParser.py", line 28, in
_CharniakParser = swig_import_helper()
File "/home/ubuntu/anaconda2/lib/python2.7/site-packages/bllipparser/CharniakParser.py", line 24, in swig_import_helper
_mod = imp.load_module('_CharniakParser', fp, pathname, description)
ImportError: /home/ubuntu/anaconda2/lib/python2.7/site-packages/bllipparser/_CharniakParser.so: undefined symbol: _ZTVNSt7__cxx1118basic_stringstreamIcSt11char_traitsIcESaIcEEE

Parser does not remain silent when it can not parse a sentence

Sometimes parser can not parse sentence. As README, there is one argument '-S' which tells parser to remain silent when it can not parse any sentence and it just goes to next sentence. But using this argument, parser does not go to next sentence. It still throws exception.

My input to parser is :

./parseIt, -l400, -K, -t4, -S, -EInputTagFile.txt, ../DATA/EN/, InputTextFile.txt

Output of this command (error) :

Warning [parseIt.C:266] Sentence 4: Parse failed from 0, inf or nan probabililty -- reparsing without POS constraints
Warning [ChartBase.C:172] Sentence 4: estimating the counts on a zero-probability sentence
parseIt: MeChart.C:105: Bst& MeChart::findMapParse(): Assertion `s' failed.
Aborted (core dumped)

What is wrong with '-S' argument ? Any suggestions ?

Update PyPI version

Hello, thanks for the great parser and the interface to Python!

While the Github version reflects updates to support Python 3, the version hosted on PyPi still does not. Could this be updated?

Find a way to build JohnsonReranker Python module with optimization enabled

See https://github.com/BLLIP/bllip-parser/blob/master/setup.py#L80 for more information.

parseIt is not threadsafe (was: Probabilities printed out of order for multi-threaded LM parser)

in first-stage/PARSE/parseIt.C , the probabilities are written out directly, instead of being pushed to the print stack. This means that they appear out of order in multi-threaded LM parsing.

Difference in the outPut of bllip-parser and the original Chanrniak rerank parser

As far as I understood the bllip-parser is the updated version of the orignal code for Charniak parser. However, I noticed differences in the output of the original Charniak reranking parser and the bllip-parser.

For example, parsing the sentence " ~~This is some text that we should, presumably, parse!~~ " the Charniak parser labels "parse" as VP whereas the bllip-parser labels it as NP.

To me, the charniak parser output seems more reasonable. Can you please explain me where this difference is coming from?

Error when using tag() function with the adverb "where"

In [3]: rrp.tag('where')

IndexError Traceback (most recent call last)
/home/DataSet/ in ()
----> 1 rrp.tag('where')

/usr/local/lib/python2.7/dist-packages/bllipparser/RerankingParser.py in tag(self, text_or_tokens)
538 text_or_tokens can be either a string or a sequence of tokens."""
539 parses = self.parse(text_or_tokens)
--> 540 return parses[0].ptb_parse.tokens_and_tags()
541
542 def _find_bad_tag_and_raise_error(self, tags):

IndexError: list index out of range

However,
When I tag a phrase like ('where is'), I got the result:

In [5]: rrp.tag('where is')
Out[5]: [('where', 'WRB'), ('is', 'VBZ')]

Can i ask how to train this parser?

First, i'm sorry that my english level(?) is so low
I'm a student in South Korea
And i wonder about how can i train the new data?
Can i train the new corpus? i can't find how to train the data.
Should i do in first-stage TRAIN directory?
i'm so curious.

Head node not a direct child

I am having the following code
`

constituency_string = str(rrp.parse_tagged(tokens, possible_tags=dict(enumerate(postags)))[0].ptb_parse)

tree = Tree(constituency_string)

For the sentence "An interesting date is four days from today.", the expected head (a direct child) and the actual head (pre-terminal) from tree object are depicted below:

(S1										# Expected head: S; Got VBZ

	(S									# Expected head: VP; Got VBZ
		(NP								# Head: NN
			(DT An) 
			(JJ interesting) 
			(NN date)) 
		(VP								# Head: VBZ
			(VBZ is) 
			(NP							# Expected head: NP; Got NNS
				(NP						# Head: NNS
					(CD four)
					(NNS days)) 
				(PP						# Head: IN
					(IN from) 
					(NP					# Head: NN
						(NN today)))))
		(. .)))

`
I am creating NAF output for the subsequent coreference resolution module. I have written additional code to match the expected results. Is this a bug in bllipparser?

c++ filenames cause difficulty on Mac OS X

Mac OS with the default file system doesn't distinguish foo.c from foo.C and so tries to compile C++ as C
generating error messages like this:

make -C first-stage/PARSE parseIt
g++ -Wall -O3 -fPIC -c Bchart.C
In file included from Term.h:28,
from Edge.h:27,
from ChartBase.h:27,
from Bchart.h:27,
from Bchart.C:24:
ECString.h:1:112: error: algorithm: No such file or directory
ECString.h:2:19: error: cstdlib: No such file or directory
ECString.h:3:19: error: cstring: No such file or directory
ECString.h:12:18: error: string: No such file or directory

http://stackoverflow.com/questions/10860882/cannot-include-standard-c-libraries-with-mingw

About Training Corpus

trainParser -parser [data directory] [training corpus] [development corpus]

if i train the training data i need 2 corpus training corpus and development corpus
but i don't know what is development corpus well.
In the [data directory], there must be featInfo.*,bugFix.txt, headInfo.txt, terms.txt, (and training corpus, dev corpus)?
Can i add my training data(parsing model) result to original training data(parsing model)?

Facing WindowsError: [Error 32] error while running sd.sd.convert_tree

The error looks like this :
WindowsError: [Error 32] The process cannot access the file because it is being used by another process: 'c:\users\nasdnjdn\appdata\local\temp\tmp6xftke'

Please help with a fix

parseIt should error if input doesn't get SGML-ish input

If the input text doesn't start with <s it is silently ignored. This should be an error.

parseIt: MeChart.C:97: Bst& MeChart::findMapParse(): Assertion `s' failed.

Continued from #17 (comment)

@gvjoshi25, I can't replicate this issue on Ubuntu or RHEL machines. Just to confirm, you're using an unmodified version of the latest parser and get the same assertion error every time?

If so, what *NIX distribution and version are you using? Also, can you include the output of these commands:

uname -a
gcc --version

It might also be interesting to see the output of the parser with the debugging flag on (-d100) but note that this produces a lot of output (2.4MB for me).

Reranker doesn't work under g++ (Ubuntu/Linaro 4.6.1-9ubuntu3) 4.6.1

Reported by Faisal Mahbub Chowdhury. This is the g++ included in Ubuntu 11.10.

Error message:

heads::syntactic_data::headchild() Error: can't find entry for category %UNDEFINED%
(best-parses: sym.h:129: const char* symbol::c_str() const: Assertion `is_defined()' failed.
Aborted

Model files shouldn't be distributed in the repository

Instead, the Makefile should download them from a website. It should run a script which would give you options for downloading the WSJ English model, the self-trained newswire model, the self-trained biomedical model, and possibly the best AnyDomain parsing models. It should handle reranker models as well.

Problem with conjoined NP's

I have found an issue with your biomedical model. It seems that the biomedical model has a tendency to unnecessarily deepening NP's and ADJP's. Consider the following sentence:

Xa and Yb proteins were found .

Using the biomedical model I get a syntax tree with an extra layer of NP's in the subtree: (NP (NP (NN Xa)) (CC and) (NP (NN Yb) (NNS proteins))

As a result of this, when I feed this parse to Stanford dependency parser, I do not get the correct dependency graph (missing 'nn' dependency relation between Xa and proteins). However, if I remove the extra layer of NP's then the dependency graph becomes correct.

The WSJ model does not have this issue. It comes up with the correct parse tree.

Please let me know if I am mistaken or am doing something wrong that is causing the problem.

can't find entry for category

I first set the GCCFLAG, by the command "export GCCFLAGS="-march=core2 -mfpmath=sse -msse2 -mmmx""
After I run the command "./parse.sh sample-data.txt"
Then it shows some errors, can you tell me why?
The error information is as below.
heads::syntactic_data::headchild() Error: can't find entry for category %UNDEFINED%
(best-parses: sym.h:129: const char* symbol::c_str() const: Assertion `is_defined()' failed.
Aborted (core dumped)

Don't know how to compile first-stage/PARSE/swig/wrapper.C [Microsoft Visual C Compiler]

When I try to build it on Windows 7 with Python 3.5 (from anaonda) and swig-3.0.8, I get the error below. Any idea about it?
Thanks!

Installing collected packages: bllipparser
Running setup.py install for bllipparser ... error
Complete output from command C:\Users\XXX\AppData\Local\Continuum\Anacond
a3\python.exe -u -c "import setuptools, tokenize;file='C:\Users\XXX\Ap
pData\Local\Temp\pip-build-xk8jp9i2\bllipparser\setup.py';exec(compile(geta
ttr(tokenize, 'open', open)(file).read().replace('\r\n', '\n'), file, 'e
xec'))" install --record C:\Users\XXX\AppData\Local\Temp\pip-kqz1g4fw-record
install-record.txt --single-version-externally-managed --compile:
running install
running build
running build_py
creating build
creating build\lib.win-amd64-3.5
creating build\lib.win-amd64-3.5\bllipparser
copying python\bllipparser\CharniakParser.py -> build\lib.win-amd64-3.5\blli
pparser
copying python\bllipparser\JohnsonReranker.py -> build\lib.win-amd64-3.5\bll
ipparser
copying python\bllipparser\ModelFetcher.py -> build\lib.win-amd64-3.5\bllipp
arser
copying python\bllipparser\ParsingShell.py -> build\lib.win-amd64-3.5\bllipp
arser
copying python\bllipparser\RerankerFeatureCorpus.py -> build\lib.win-amd64-3
.5\bllipparser
copying python\bllipparser\RerankingParser.py -> build\lib.win-amd64-3.5\bll
ipparser
copying python\bllipparser\Utility.py -> build\lib.win-amd64-3.5\bllipparser

copying python\bllipparser\__init__.py -> build\lib.win-amd64-3.5\bllipparse

r
copying python\bllipparser__main__.py -> build\lib.win-amd64-3.5\bllipparse
r
running build_ext
building 'bllipparser._CharniakParser' extension
error: Don't know how to compile first-stage/PARSE/swig/wrapper.C

g++ compilation error on gcc version 4.5.2

I am getting this error on: [Linux ubuntu 2.6.38-8-generic #42-Ubuntu SMP Mon Apr 11 03:31:50 UTC 2011 i686 i686 i386 GNU/Linux] with [gcc version 4.5.2 (Ubuntu/Linaro 4.5.2-8ubuntu4) ]

tree.h: In constructor ‘tree_node<label_type>::tree_node(const tree_label::cat_type&, tree_node<label_type>, tree_node<label_type>)’:
tree.h:319:39: error: cannot call constructor ‘tree_label::tree_label’ directly
tree.h:319:39: error: for a function-style cast, remove the redundant ‘::tree_label’

Create and distribute Debian/Ubuntu/RHEL packages

Large output dump to stderr when parsing a "word" with a space in it.

If I parse a sentence that has a space in the word (e.g. "tea pot" from below), I get a huge dump to stderr of the form "## ". It looks like I'm getting roughly 5k lines. I still end up with an acceptable parse. I'd like to be able to turn off this output dump.

$ python
Python 3.5.1+ (default, Mar 30 2016, 22:46:26) 
[GCC 5.3.1 20160330] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from bllipparser import RerankingParser
>>> rp = RerankingParser.fetch_and_load('WSJ-PTB3')
rp.parser(">>> 
>>> rp.parse(["I", "am", "a", "teapot"])[0]
ScoredParse('(S1 (S (NP (PRP I)) (VP (VBP am) (NP (DT a) (NN teapot)))))', parser_score=-56.08248356757865, reranker_score=-14.256298669561028)
>>> rp.parse(["I", "am", "a", "tea pot"])[0]
## preterms = ((PRP i) (VBP am) (DT a) (NN tea pot))
## tp = (S1 (S (NP (PRP i)) (VP (VBP am) (NP (DT a) (NN tea pot)))))
## preterms = ((PRP i) (VBP am) (DT a) (JJ tea pot))

... <snip about 5k lines> ...

## tp = (S1 (S (NP (PRP i)) (VP (VBP am) (SBAR (S (NP (NP (DT a)) (NNS tea pot)))))))
## preterms = ((PRP i) (VBP am) (DT a) (JJ tea pot))
## tp = (S1 (S (NP (NP (PRP i)) (VP (VBP am) (NP (DT a) (JJ tea pot))))))
ScoredParse('(S1 (S (NP (PRP I)) (VP (VBP am) (NP (DT a) (NN tea pot)))))', parser_score=-56.08248356757865, reranker_score=-16.96028206416102)

I realize that I'm doing some things that the parser wasn't intended to do, but I'm working with non-clean data, so I'm checking how it behaves if we get something unexpected.

MacOS: Compiled successfully, but module is missing

I'm on MacOS 10.12.3 and followed the hints in the other issue on failing MacOS compilation (#19). Finally, I was able to get the compilation and python package setup run through, but I still cannot use the parser.

fxa:~ felix$ python3
Python 3.6.0 (v3.6.0:41df79263a11, Dec 22 2016, 17:23:13)
[GCC 4.2.1 (Apple Inc. build 5666) (dot 3)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import bllipparser
Traceback (most recent call last):
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/bllipparser-2016.9.11-py3.6-macosx-10.6-intel.egg/bllipparser/CharniakParser.py", line 14, in swig_import_helper
    return importlib.import_module(mname)
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/importlib/__init__.py", line 126, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 978, in _gcd_import
  File "<frozen importlib._bootstrap>", line 961, in _find_and_load
  File "<frozen importlib._bootstrap>", line 950, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 648, in _load_unlocked
  File "<frozen importlib._bootstrap>", line 560, in module_from_spec
  File "<frozen importlib._bootstrap_external>", line 922, in create_module
  File "<frozen importlib._bootstrap>", line 205, in _call_with_frames_removed
ImportError: dlopen(/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/bllipparser-2016.9.11-py3.6-macosx-10.6-intel.egg/bllipparser/_CharniakParser.cpython-36m-darwin.so, 2): no suitable image found.  Did find:
	/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/bllipparser-2016.9.11-py3.6-macosx-10.6-intel.egg/bllipparser/_CharniakParser.cpython-36m-darwin.so: mach-o, but wrong architecture

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/bllipparser-2016.9.11-py3.6-macosx-10.6-intel.egg/bllipparser/__init__.py", line 399, in <module>
    from .RerankingParser import RerankingParser, Tree, Sentence, tokenize
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/bllipparser-2016.9.11-py3.6-macosx-10.6-intel.egg/bllipparser/RerankingParser.py", line 19, in <module>
    from . import CharniakParser as parser
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/bllipparser-2016.9.11-py3.6-macosx-10.6-intel.egg/bllipparser/CharniakParser.py", line 17, in <module>
    _CharniakParser = swig_import_helper()
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/bllipparser-2016.9.11-py3.6-macosx-10.6-intel.egg/bllipparser/CharniakParser.py", line 16, in swig_import_helper
    return importlib.import_module('_CharniakParser')
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/importlib/__init__.py", line 126, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
ModuleNotFoundError: No module named '_CharniakParser'
>>>

terminate called after throwing an instance of 'swig::stop_iteration'

I ran into the following issue when trying to run the parser. Any idea what might cause this?

>>> from bllipparser import RerankingParser
>>> rrp = RerankingParser.fetch_and_load('WSJ-PTB3', verbose=True)
>>> rrp.simple_parse("It's that easy.")
terminate called after throwing an instance of 'swig::stop_iteration'
Aborted

Thanks!

parseIt exits when it sees an empty sentence

If you give parseIt this input file:

 <s> </s>
 <s> Non-empty </s>

it will exit before parsing "Non-empty".

(thanks to Joel Tetrault for the bug report)

bllipparser should provide an n-best list reader

Should be able to read first and second stage n-best lists.

Invalid trees break tree reader internal state

Example:

>>> from bllipparser import Tree
>>> s = '(S1 (NNS Markets) (: --))' # valid tree

>>> t1 = Tree('(())') # invalid tree
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python2.7/dist-packages/bllipparser/RerankingParser.py", line 63, in __init__
    parser.inputTreeFromString(input_tree_or_string)
RuntimeError: [first-stage/PARSE/utils.C:52]: Saw paren rather than term

>>> Tree(s) # crashes 1st time
Saw )
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python2.7/dist-packages/bllipparser/RerankingParser.py", line 63, in __init__
    parser.inputTreeFromString(input_tree_or_string)
RuntimeError: [first-stage/PARSE/utils.C:52]: Should have seen an open paren here.

>>> Tree(s) # succeeds 2nd time
Tree('(S1 (NNS Markets) (: --))')

(Thanks @cdg720 for catching this!)

Segmentation fault when parsing non-clean sentence

I get a segmentation fault when parsing ["I", "am", "a", "little", "teapot", ".", " ", "What", "?"] using WSJ-PTB3. See the repl paste below.

$ python
Python 3.5.1+ (default, Mar 30 2016, 22:46:26) 
[GCC 5.3.1 20160330] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import bllipparser
>>> bllipparser.__version__
'2016.9.11'
>>> from bllipparser import RerankingParser
>>> rp = RerankingParser.fetch_and_load('WSJ-PTB3', verbose=True)
Model directory: /home/halfak/.local/share/bllipparser/WSJ-PTB3
Model directory already exists, not reinstalling
>>> rp.parse(["I", "am", "a", "little", "teapot"])[0]
ScoredParse('(S1 (S (NP (PRP I)) (VP (VBP am) (NP (DT a) (JJ little) (NN teapot)))))', parser_score=-64.30434900543281, reranker_score=-16.740114175058775)
>>> rp.parse(["I", "am", "a", "little", "teapot", ".", " ", "What", "?"])[0]
## preterms = ((PRP i) (VBP am) (DT a) (JJ little) (NN teapot) (. .) (VP vbz (NP (WP what))) (. ?))
## tp = (S1 (S (S (NP (PRP i)) (VP (VBP am) (NP (DT a) (JJ little) (NN teapot))) (. .)) (VP vbz (NP (WP what))) (. ?)))
Segmentation fault (core dumped)

I'm running Ubuntu 16.04 64bit.

$ uname -a
Linux graphite 4.4.0-36-generic #55-Ubuntu SMP Thu Aug 11 18:01:55 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux

I found this in /var/log/syslog:

Sep 13 20:24:13 graphite kernel: [625702.799534] python[9585]: segfault at 40 ip 00007f87e0ff21c0 sp 00007fff6cf55e20 error 4 in _JohnsonReranker.cpython-35m-x86_64-linux-gnu.so[7f87e0fa9000+92000]

Makefile error on SunOS

I got this error on [SunOS charlie 5.10 Generic_144488-06 sun4u sparc SUNW,Sun-Fire-280R]:

Fatal error in reader: Makefile, line 179: Unexpected end of line seen

Add and package higher-level Java interface

Python has a higher-level interface to SWIG, but the current repository doesn't include one for Java. I've written one for Java. However, it is not well packaged and thus hadn't been checked in previously.

From what I can, the best way to distribute Java native code is as a .nar file (http://maven-nar.github.io/), so I hope to move towards that type of system. If you're a Maven/Java packaging expert, please feel free to get involved!

Python 3.5 support

When I try to build it with Python 3.5 I get the following error

Collecting bllipparser
  Using cached bllipparser-2015.12.3.tar.gz
Requirement already satisfied (use --upgrade to upgrade): six in /usr/lib/python3.5/site-packages (from bllipparser)
Installing collected packages: bllipparser
  Running setup.py install for bllipparser ... error
    Complete output from command /usr/bin/python -u -c "import setuptools, tokenize;__file__='/tmp/pip-build-84szdo98/bllipparser/setup.py';exec(compile(getattr(tokenize, 'open', open)(__file__).read().replace('\r\n', '\n'), __file__, 'exec'))" install --record /tmp/pip-hkxweaja-record/install-record.txt --single-version-externally-managed --compile:
    running install
    running build
    running build_py
    creating build
    creating build/lib.linux-x86_64-3.5
    creating build/lib.linux-x86_64-3.5/bllipparser
    copying python/bllipparser/CharniakParser.py -> build/lib.linux-x86_64-3.5/bllipparser
    copying python/bllipparser/ModelFetcher.py -> build/lib.linux-x86_64-3.5/bllipparser
    copying python/bllipparser/RerankingParser.py -> build/lib.linux-x86_64-3.5/bllipparser
    copying python/bllipparser/ParsingShell.py -> build/lib.linux-x86_64-3.5/bllipparser
    copying python/bllipparser/__init__.py -> build/lib.linux-x86_64-3.5/bllipparser
    copying python/bllipparser/JohnsonReranker.py -> build/lib.linux-x86_64-3.5/bllipparser
    copying python/bllipparser/Utility.py -> build/lib.linux-x86_64-3.5/bllipparser
    copying python/bllipparser/RerankerFeatureCorpus.py -> build/lib.linux-x86_64-3.5/bllipparser
    copying python/bllipparser/__main__.py -> build/lib.linux-x86_64-3.5/bllipparser
    running build_ext
    building 'bllipparser._CharniakParser' extension
    creating build/temp.linux-x86_64-3.5
    creating build/temp.linux-x86_64-3.5/first-stage
    creating build/temp.linux-x86_64-3.5/first-stage/PARSE
    creating build/temp.linux-x86_64-3.5/first-stage/PARSE/swig
    gcc -pthread -Wno-unused-result -Wsign-compare -Wunreachable-code -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -march=x86-64 -mtune=generic -O2 -pipe -fstack-protector-strong -fPIC -Ifirst-stage/PARSE/ -I/usr/include/python3.5m -c first-stage/PARSE/swig/wrapper.C -o build/temp.linux-x86_64-3.5/first-stage/PARSE/swig/wrapper.o
    cc1plus: warning: command line option ‘-Wstrict-prototypes’ is valid for C/ObjC but not for C++
    first-stage/PARSE/swig/wrapper.C:30900:1: error: too many initializers for ‘PyAsyncMethods’
     };
     ^
    ... more errors ensue....

I've never touched swig in my life, but looking at this it's seems that wrapper.C is not being regenerated properly.

Parser's behavior in absence of -K

When the parser reads from stdin, it behaves strangely if -K is absent. It does not immediately produce output for the first sentence. it waits until the next sentence is given and then it gives output for the previous sentence. It was a problem for me since I was feeding it through a pipe. I found a silly work around though. Instead of giving it an input like .. \n I give ~~...~~ \n.\n and it works.

re-train parser get segmentation fault error "/bin/sh: line 1: 64668 Segmentation fault second-stage/programs/prepare-data/ptb -n 20 -x 00 -e"

Hi,
could somebody help me with this error? I'm running the parser under 64bit redhat 5. When training the parser with large dataset, similar to WSJ, I get this segmentation fault error from ptb. I'm using 8 threads.

"/bin/sh: line 1: 64668 Segmentation fault second-stage/programs/prepare-data/ptb -n 20 -x 00 -e ....."

Thanks,

bllipparser.Tree should allow access to the head finding information

This is probably the only remaining feature from PyInputTree that's not in the Python version of BLLIP Parser.

Invalid trees can make tree reader spin

For example:

>>> from bllipparser import Tree
>>> Tree('(S1 (NP')
[never returns]

bllip parser in Server

Please let me know if bllip parser can be used in this fashion. I am using Flask.

During start-up:
rrp = RerankingParser.fetch_and_load('WSJ-PTB3', verbose=True)

While processing (multiple concurrent) requests from clients (Same object rrp is reused):
constituency_lisp_string = str(rrp.parse_tagged(tokens, possible_tags=dict(enumerate(postags)))[0].ptb_parse)

Biomedical named entities being treated as Cardinals

I am using the GENIA+PubMed model and parsing biomedical text.

A very frequent issue I have observed is that certain biomedical entities such as microRNAs (miRs) are being tagged as CD rather than NN.

Example:
Sentence: (named entities highlighted in bold)
Human micro-RNAs miR-223, miR-26b, miR-221, miR-103-1, miR-185, miR-23b,miR-203, miR-17-5p, miR-23a, and miR-205 were significantly up-regulated in bladder cancers.

(S1 (S (S (NP (NP (NP (JJ Human) (NNS micro-RNAs)) (QP (CD miR-223) (, ,) (CD miR-26b) (, ,) (CD miR-221) (, ,) (CD miR-103-1) (, ,) (CD miR-185) (, ,) (CD miR-23b) (, ,) (CD miR-203))) (, ,) (NP (NP (NN miR-17-5p)) (, ,) (NP (NN miR-23a)) (, ,) (CC and) (NP (NN miR-205)))) (VP (VBD were) (ADVP (RB significantly)) (VP (VBN up-regulated) (PP (IN in) (NP (JJ bladder) (NNS cancers)))))) (. .)))

I have regular expression for identifying such miRs named entities. Is there a way to force these entities to recognized as NN and thus part of NP instead of QP?

Note: Since I am these parse trees generated by Bllip to get Universal Dependencies using Stanford Typed Dependency Converter, dependencies between these entities (incorrectly tagged as CDs) are being incorrectly identified.

basic usage of bllip-parser

Hello, I am new here. First time to use this parser, it a really cool software.
while it is still be a stranger to me now. I do need some help, many thanks.

First:
If you want to make it slightly easier for humans to read, use the
command line argument -P (pretty print),

I try "./parse.sh -P <sourcefile.txt>", but it didn't work. What is the correct command?

Second:
if I want the "stdout" to be stored in the .txt file, what should I do?

Thanks for you time!

Compilation issue under OSX

When attempting to compile BLIIP through TEES, and later directly, I got the following problems:

-- first, I get a "invalid value '6' in '-O6'" when best-parses is compiled ('make -C second-stage/programs/features best-parses'). It seems that the OSX clang compiler (version 4.2 on my 10.7.5 box) doesn't accept optimization levels that high. Setting it to 3 appears to fix this problem. Note: the compilation of parseIt ('make -C first-stage/PARSE parseIt') uses -O3 instead of -O6; not sure why

-- second, once the optimization flag is changed, I get the following compilation errors: http://pastebin.com/5CiGgs5Q

Any help would be appreciated.

Thanks,
Aurélien

Can't pickle RerankingParser: attribute lookup RerankerModel on importlib._bootstrap failed

$ python
Python 3.5.1+ (default, Mar 30 2016, 22:46:26) 
[GCC 5.3.1 20160330] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from bllipparser import RerankingParser
>>> bllip_rrp = RerankingParser.fetch_and_load('WSJ-PTB3')
>>> import pickle
>>> len(pickle.dumps(bllip_rrp))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
_pickle.PicklingError: Can't pickle <class 'JohnsonReranker.RerankerModel'>: attribute lookup RerankerModel on importlib._bootstrap failed

Will not compile on windows

It looks like the installer is not properly compiling bllipparser on windows.

Specs:
win7 64
python 2.7
latest VCForPython27.msi (as of 03/13/2017)

Console:

> pip install bllipparser
Collecting bllipparser
  Using cached bllipparser-2016.9.11.tar.gz
Requirement already satisfied: six in c:\python27\lib\site-packages (from bllipparser)
Installing collected packages: bllipparser
  Running setup.py install for bllipparser ... error
    Complete output from command c:\python27\python.exe -u -c "import setuptools, tokenize;__file__='c:\\users\\me\\appdata\\local\\temp\\pip-build-olmpji\\bllipparser\\setup.py';f=getattr(tokenize, 'open', open)(__file__);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, __file__, 'exec'))" install --record c:\users\me\appdata\local\temp\pip-ftrhgb-record\install-record.txt --single-version-externally-managed --compile:
    running install
    running build
    running build_py
    creating build
    creating build\lib.win-amd64-2.7
    creating build\lib.win-amd64-2.7\bllipparser
    copying python\bllipparser\CharniakParser.py -> build\lib.win-amd64-2.7\bllipparser
    copying python\bllipparser\JohnsonReranker.py -> build\lib.win-amd64-2.7\bllipparser
    copying python\bllipparser\ModelFetcher.py -> build\lib.win-amd64-2.7\bllipparser
    copying python\bllipparser\ParsingShell.py -> build\lib.win-amd64-2.7\bllipparser
    copying python\bllipparser\RerankerFeatureCorpus.py -> build\lib.win-amd64-2.7\bllipparser
    copying python\bllipparser\RerankingParser.py -> build\lib.win-amd64-2.7\bllipparser
    copying python\bllipparser\Utility.py -> build\lib.win-amd64-2.7\bllipparser
    copying python\bllipparser\__init__.py -> build\lib.win-amd64-2.7\bllipparser
    copying python\bllipparser\__main__.py -> build\lib.win-amd64-2.7\bllipparser
    running build_ext
    building 'bllipparser._CharniakParser' extension
    error: Don't know how to compile first-stage/PARSE/swig/wrapper.C

    ----------------------------------------
Command "c:\python27\python.exe -u -c "import setuptools, tokenize;__file__='c:\\users\\me\\appdata\\local\\temp\\pip-build-olmpji\\bllipparser\\setup.py';f=getattr(tokenize, 'open', open)(__file__);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, __file__, 'exec'))" install --record c:\users\me\appdata\local\temp\pip-ftrhgb-record\install-record.txt --single-version-externally-managed --compile" failed with error code 1 in c:\users\me\appdata\local\temp\pip-build-olmpji\bllipparser\

Tokenizer doesn't segment double/triple hyphens correctly

a--b should be a, --, b. Currently it is not split up by the tokenizer.

Persian Language parser and reranker

Is it possible to make Persian language parser and reranker based on bllip-parser?
Assume i have a Persian tree bank, what i have to do?

Maximum sentence length is not really the maximum sentence length.

It seems that there are (at least) two off-by-one errors with these calculations:

shell% ./parseIt -l399 ../DATA/EN 398.sgml
<doesn't crash, gives dummy parse>
shell% ./parseIt -l399 ../DATA/EN 399.sgml
parseIt: GotIter.C:73: void LeftRightGotIter::makelrgi(Edge*): Assertion `i < 400' failed.
<segfaults>
shell% ./parseIt -l399 ../DATA/EN 400.sgml
<doesn't crash, sentence is "skipped" and dummy parse is printed instead>

The obvious workaround is to only parse things that are two fewer than the maximum sentence length (unlikely to be much of an issue in practice).

About Training Data

Can i ask more?

I want to save existing data and add my new training data ,is that possible?
For example I want to get train data that original data and my new data.
Can you understand this? Hmm....If i train the data then the original data is blow up? or remain?
I saw the DATA directory in first-stage and data files are numbers....
Should i run the some program to get the data like that? Can you tell me how can i get training data form?
Thank you for reading!

Compilation fails under OS X 10.9 (Mavericks)

While I have had the same problem that is described here, the suggested steps only brought me so far. The compiler doesn't find <ext/stdio_filebuf.h>, which, I'm sure, is because of changes by Apple in 10.9. Here's the relevant part of the output:

g++ -MMD -O3 -Wall -ffast-math -finline-functions -fomit-frame-pointer -fno-strict-aliasing -fPIC  Bchart.o BchartSm.o Bst.o FBinaryArray.o CntxArray.o ChartBase.o ClassRule.o ECArgs.o Edge.o EdgeHeap.o ExtPos.o Feat.o Feature.o FeatureTree.o Field.o FullHist.o GotIter.o InputTree.o Item.o Link.o Params.o ParseStats.o SentRep.o ScoreTree.o Term.o TimeIt.o UnitRules.o ValHeap.o edgeSubFns.o ewDciTokStrm.o extraMain.o fhSubFns.o headFinder.o headFinderCh.o utils.o MeChart.o  parseIt.o -o parseIt -D_REENTRANT -D_XOPEN_SOURCE=600 -lpthread
/Applications/Xcode.app/Contents/Developer/usr/bin/make -C second-stage/programs/features best-parses
g++ -MMD -O3 -Wall -ffast-math -finline-functions -fomit-frame-pointer -fno-strict-aliasing -fPIC  -Wno-deprecated   -c -o best-parses.o best-parses.cc
In file included from best-parses.cc:50:
./popen.h:25:10: fatal error: 'ext/stdio_filebuf.h' file not found
#include <ext/stdio_filebuf.h>
         ^
1 error generated.
make[1]: *** [best-parses.o] Error 1
make: *** [reranker-runtime] Error 2

Python 3 version in the works?

Hi, Are you planning on a Python3 version? It shouldn't be too difficult to port it, basically fixing prints, "open" for "file", and mainly fixing iterable vs. list indexing. Thanks

Carlos

bllip / bllip-parser Goto Github PK

bllip-parser's Introduction

BLLIP Reranking Parser

Overview

Compiling the parser

Obtaining parser models

Running the parser

More questions?

Parser details

Reranker details

Other versions of the parser

References

bllip-parser's People

Contributors

Stargazers

Watchers

Forkers

bllip-parser's Issues

In [3]: rrp.tag('where')

Recommend Projects

Recommend Topics

Recommend Org