datamade / parserator Goto Github PK

View Code? Open in Web Editor NEW

788.0 34.0 85.0 155 KB

:bookmark: A toolkit for making domain-specific probabilistic parsers

Home Page: http://parserator.datamade.us

License: MIT License

Python 100.00%

nlp-parsing python probabilistic-parser crf

parserator's Introduction

parserator

A toolkit for making domain-specific probabilistic parsers

Do you have domain-specific text data that would be much more useful if you could derive structure from the strings? This toolkit will help you create a custom NLP model that learns from patterns in real data and then uses that knowledge to process new strings automatically. All you need is some training data to teach your parser about its domain.

What does probabilistic parser do?

Given a string, a probabilistic parser will break it out into labeled components. The parser uses conditional random fields to label components based on (1) features of the component string and (2) the order of labels.

When is a probabilistic parser useful?

A probabilistic parser is particularly useful for sets of strings that may have common structure/patterns, but which deviate from those patterns in ways that are difficult to anticipate with hard-coded rules.

For example, in most cases, addresses in the United States start with a street number. But there are exceptions: sometimes valid U.S. addresses deviate from this pattern (e.g., addresses starting with a building name or a P.O. box). Furthermore, addresses in real data sets often include typos and other errors. Because there are infinitely many patterns and possible typos to account for, a probabilistic parser is well-suited to parse U.S. addresses.

With a probabilistic (as opposed to a rule-based approach) approach, the parser can continually learn from new training data and thus continually improve its performance!

Some other examples of domains where a probabilistic parser can be useful:

addresses in other countries with unfamiliar conventions
product names/descriptions (e.g., parsing phrases like "Twizzlers Twists, Strawberry, 16-Ounce Bags (Pack of 6)" into brand, item, flavor, weight, etc.)
citations in academic writing

Examples of parserator

usaddress - Our first probabilistic parser and the basis for the parserator toolkit, it parses any address in the United States. Read our blog post on how it works.
probablepeople - Parser for romanized person names.

Try out these parsers on our web interface!

How to make a parser - quick overview

For more details on each step, see the parserator documentation.

Initialize a new parser

pip install parserator
parserator init [YOUR PARSER NAME]
python setup.py develop

Configure the parser to your domain
- configure labels (i.e., the set of possible tags for the tokens)
- configure the tokenizer (i.e., how a raw string will be split into a sequence of tokens to be tagged)
Define features relevant to your domain
- define token-level features (e.g., length, casing)
- define sequence-level features (e.g., whether a token is the first token in the sequence)
Prepare training data
- Parserator reads training data in XML format
- To create XML training data output from unlabeled strings in a CSV file, use parserator's command line interface to manually label tokens. It uses values in first column, and it ignores other columns. To start labeling, run parserator label [infile] [outfile] [modulename]
- For example, parserator label unlabeled/rawstrings.csv labeled_xml/labeled.xml usaddress
Train your parser
- To train your parser on your labeled training data, run parserator train [traindata] [modulename]
- For example, parserator train labeled_xml/labeled.xml usaddress or parserator train "labeled_xml/*.xml" usaddress
- After training, your parser will have an updated model, in the form of a .crfsuite settings file
Repeat steps 3-5 as needed!

How to use your new parser

Once you are able to create a model from training data, install your custom parser by running python setup.py develop.

Then, in a Python shell, you can import your parser and use the parse and tag methods to process new strings. For example, to use the probablepeople module:

>>> import probablepeople
>>> probablepeople.parse('Mr George "Gob" Bluth II')
[('Mr', 'PrefixMarital'), ('George', 'GivenName'), ('"Gob"', 'Nickname'), ('Bluth', 'Surname'), ('II', 'SuffixGenerational')]

Important Links

Documentation: http://parserator.rtfd.org/
Web interface for trying out parsers: http://parserator.datamade.us/
Blog post: http://datamade.us/blog/parse-name-or-parse-anything-really
Repository: https://github.com/datamade/parserator
Issues: https://github.com/datamade/parserator/issues
Distribution: https://pypi.python.org/pypi/parserator

Team

Forest Gregg, DataMade
Cathy Deng, DataMade

Errors and Bugs

If something is not behaving intuitively, it is a bug and should be reported. Report an issue.

Patches and Pull Requests

We welcome your ideas! You can make suggestions in the form of GitHub issues (bug reports, feature requests, general questions), or you can submit a code contribution via a pull request.

How to contribute code:

Fork the project.
Make your feature addition or bug fix.
Send us a pull request with a description of your work! Don't worry if it isn't perfect: think of a PR as a start of a conversation rather than a finished product.

Copyright and Attribution

parserator's People

Contributors

Stargazers

Watchers

Forkers

math4youbyusgroupillinois riskyhe309 chagge yanlinaung zirui-zhang ilovejs dataqc yucheng1992 felixzhuologist hellcoderz 0xuye0 dionysio gitter-badger ambier darriall anukat2015 et-al-health mikemoraned pawl rlugojr jbn datnamer dolinsky ftao pilgrim2go shannons-ds chhfttd wedwardbeck pmart123 theolivenbaum phillbaker muddasani stevemmarshall hitxujian akankshashriv wmnaylor chenmoshushi benjamesbabala bigrlab rvnthvrm sampwing nitanilla pombredanne snowytoxa jiacli abromeit dwaynepaschall guptam afcarl adrianhumphrey111 sycaptx4869 gwerbin msssh indigos33k3r lakshya0002 mloliver salusphilip desaetiis myo-ai ufukhurriyetoglu paudan ftfarias sahiliem vagelim vikram-saha ajaykumar005 andy-wagner andrewhoskin iongrytsku khanhtungtran nadeemnazeer ridzwan691999 jack1981 micahlyle martinjack gnivil aahmadai abeusher kevin840720 mlissner network-technology-academy-institute gitrdm

parserator's Issues

Exposing lower level model evaulation data

Thanks for all the hard work on this! Parserator has definitely made it easy to create a model with crfsuite. As I dig into fine tuning my model, I'd like to have access to the metrics provided by crfsuite (accuracy, precision, recall).

It looks the python wrapper does provide access to this data (scrapinghub/python-crfsuite#42 (comment)), what do you think of a PR that exposes this as a return value of trainModel?

Upload to Pypi

I think we are ready for this @cathydeng

ValueError when training

Hello

I'm trying to create a module similar to usaddress but for another country so I had to set up new labels. I made a labeled xml file with those labels. However when I try the parserator train command line i get this error :

Traceback (most recent call last):
File "/home/n1a/.local/bin/parserator", line 11, in
sys.exit(dispatch())
File "/home/n1a/.local/lib/python2.7/site-packages/parserator/main.py", line 37, in dispatch
args.func(args)
File "/home/n1a/.local/lib/python2.7/site-packages/parserator/main.py", line 66, in train
training.train(module, train_file_list)
File "/home/n1a/.local/lib/python2.7/site-packages/parserator/training.py", line 92, in train
trainModel(training_data, module)
File "/home/n1a/.local/lib/python2.7/site-packages/parserator/training.py", line 31, in trainModel
trainer.append(xseq, yseq)
File "pycrfsuite/_pycrfsuite.pyx", line 312, in pycrfsuite._pycrfsuite.BaseTrainer.append
ValueError: The numbers of items and labels differ: |x| = 1, |y| = 13

I am using python 2.7

guidelines for reporting parsing issues for usaddress/probablepeople

would be nice to add guidelines (in web interface, readmes, & documentation) so that we are not having the same conversations over and over on github. maybe a link to an issue template?

real strings from real data
include how tokens were labeled & how they should be labeled
include a few (<5) similar examples to train on, if available

Use subcommands for parserators

Let's use sub-commands for parserators: https://docs.python.org/2/library/argparse.html#sub-commands

Something like

> parserator
usage: parserator {label, train} ...

positional arguments:
  {label, train}
> parserator label
usage: parserator label infile outfile modulename

parserator update: error: the following arguments are required: infile, outfile, module
> parserator train
usage: parserator training_file modulename
parserator trail: error the following arguments are required: training_file modulename

parserator and ducking.wit.ai

Hi,

I just found out this repository as I was looking for a probabilistic parser in python. It is great, congratulations! I have been using wit.ai's duckling (duckling.wit.ai) for a while and I wonder how simple it would be to be able to parse numbers and dates with parserator.

Thanks

Training parserator with names in text/Problem with variations of name formats

Hi
Finished training with data that has name formats.
Now testing with different formats and it is failing due to non trained formats.
Is there a way go get past having to train all different permutations of name formats?
This will be very challenging for name parsing...
Help !

Parserator train command should take a glob

I am training parserator on lots of XML files (one XML file for each document cloud id). It would be nice if parserator could train on a glob of xml files (at least I would want this feature).

So like:

parserator train *.xml contract_parser

Thoughts?

note about trouble installing crfsuite on Windows

Apparently there is a known issue for installing crfsuite on Windows: scrapinghub/python-crfsuite#16

Should we make note of this somewhere in the docs?

Add fgregg to pypi permissions

So I can update pypi/parserator. Please also do this with usaddress and probablepeople.

Error on running parserator with no MODULE FOUND ERROR

How can I create an initial model file for training? I run with this command and it returns me this error
"ModuleNotFoundError: No module named 'mymodel"

parserator label unlabeled/rawstrings.csv labeled_xml/labeled.xml mymodel

Give more weight to some labels

Native support for parallelization

Parserator starts lagging when training on large documents -- and when parsing large documents. It would be nice if Parserator used some of the Python parallelization modules to take advantage of multiple cores, when available. Right now, I'm working around this (somewhat) using the parallel command. But it would be great if you could just pass parserator a -P flag and it would parallelize for you.

parallel python parse_documents.py ::: $(find path/to/corpus/*_text.txt -type f)

Fix simple typo: represention -> representation

Issue Type

[x] Bug (Typo)

Steps to Replicate

Examine docs/index.rst.
Search for represention.

Expected Behaviour

Should read representation.

Semi-automated issue generated by
https://github.com/timgates42/meticulous/blob/master/docs/NOTE.md

To avoid wasting CI processing resources a branch with the fix has been
prepared but a pull request has not yet been created. A pull request fixing
the issue can be prepared from the link below, feel free to create it or
request @timgates42 create the PR. Alternatively if the fix is undesired please
close the issue with a small comment about the reasoning.

https://github.com/timgates42/parserator/pull/new/bugfix_typo_representation

Thanks.

Training on usaddress and probable twice on same dataset

Is there a way to train on parserator with usaddress and probablepeople with one training session to avoid having two sessions? This is time consuming to retrain twice on each text string.
One session training on usaddress and another training session on probablepeople....!
Here is some text to consider.....

TRUST CA073108132 EXHIBIT "A" IDENTIFICATION OF PROPERTY OWNERS AND PROPERTY DESCRIPTION Record Owner(s) Names: FOLEY JONES FAMILY TRUST; Trustee(s): Patricia Jones, Henry Foley Address: 12349 Oliva Rd, San Diego, CA 92128

parserator's us_address succesfully trained .crfsuite file not being used and also not found in the directories

Labelling and training of parserator for usaddress has been successful. Once trained, the resulting .crfsuite file is supposedly & said to be saved at ...workingdirectory/usaddress/usaddress.crfsuite

But no such file is found. In fact no such folder exists in that directory. Also tried parsing after training, assuming if it is saved in some other location and only i couldn't see it. But the parsing after training is default parsing and not according to the trained file ( i can tell because i am testing on the same trained file and i am getting different results.)

https://github.com/datamade/parserator/issues/3
i have looked into this issue but in my case, am not finding the file nor getting results as per training.

Kindly advise on how to proceed?

utf compatibility

Add

# -*- coding: utf-8 -*-

to the header of every parserator.py file and it's descendant parsers.

Ran into this problem with @AbeHandler

fix console label behavior when there are no strings left to label

shouldn't create file for leftover unlabeled strings when raw_strings_left is empty

https://github.com/datamade/parserator/blob/master/parserator/manual_labeling.py#L216

README: Add more code snippets. Copyedit. Add info about the team.

Could use more code snippets in the "how to" section...maybe a GIF, too.

Labeller incompatible with Python 3

The label function in manual_labeling expects the input to be Unicode-encoded, but Python 3 open defaults to Unicode.
data_prep_util's appendListToXMLfile doesn't specify a write mode (thus defaulting to Unicode), while etree.tostring produces a bytes string.
etree.tostring will have XML-escaped Unicode characters, so decoding its output in test_manual_labeling accomplishes nothing.

`parserator usaddress train --traindata training/labeled.xml` creates usaddress.crfsuite in wrong place

Does not create it in the module subdirectory but in the current working directory.

Support or experiment with GPU backend

example: http://sfb876.tu-dortmund.de/crfgpu/linear_crf_cuda.html

TypeError: 'write() argument 1 must be unicode, not str'

python 2.7

:~/parser$ parserator init test_123

Initializing directories for test_123 <open file '', mode 'w' at 0x7f98203211e0>

test_123 <open file '', mode 'w' at 0x7f98203211e0>
raw <open file '', mode 'w' at 0x7f98203211e0>
training <open file '', mode 'w' at 0x7f98203211e0>
tests <open file '', mode 'w' at 0x7f98203211e0>

Generating init.py <open file '', mode 'w' at 0x7f98203211e0>
Traceback (most recent call last):
File "/home/joerib/.local/bin/parserator", line 8, in
sys.exit(dispatch())
File "/home/joerib/.local/lib/python2.7/site-packages/parserator/main.py", line 70, in dispatch
args.func(args)
File "/home/joerib/.local/lib/python2.7/site-packages/parserator/main.py", line 118, in init
f.write(parser_template.init_template())
TypeError: write() argument 1 must be unicode, not str

"Parserator init" The specified procedure could not be found

Using Visual Studio 2017
Using Windows 10 64 bit

So I've run:

pip install parserator
Requirement already satisfied: parserator in c:\program files\python36\lib\site-packages
Requirement already satisfied: future>=0.14.3 in c:\program files\python36\lib\site-packages (from parserator)
Requirement already satisfied: python-crfsuite>=0.7 in c:\program files\python36\lib\site-packages (from parserator)
Requirement already satisfied: lxml>=3.7.3 in c:\program files\python36\lib\site-packages (from parserator)
Requirement already satisfied: backports.csv in c:\program files\python36\lib\site-packages (from parserator)

Then I've Run:

c:\users\kristof\documents\visual studio 2017\Projects\PythonApplication1\PythonApplication1>parserator init bankstatement
Traceback (most recent call last):
  File "c:\program files\python36\lib\runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "c:\program files\python36\lib\runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "C:\Program Files\Python36\Scripts\parserator.exe\__main__.py", line 5, in <module>
  File "c:\program files\python36\lib\site-packages\parserator\main.py", line 14, in <module>
    from . import manual_labeling
  File "c:\program files\python36\lib\site-packages\parserator\manual_labeling.py", line 9, in <module>
    from lxml import etree
ImportError: DLL load failed: The specified procedure could not be found.

What I am doing wrong to init this?

Adapting parserator to handle an entire document

I am currently using Parserator to parse short strings, like this:

s of 1110 of an hour. The maximum amount to be paid under this contract is $20,000.00. No amount of work is guaranteed under this agreement; payments wil

and this

General Liability insurance will be purchased and maintained with limits of $1,000,000 per occurrence an

I extract these strings using a loose regular expression ".{75}$[0-9]+.{75}" on documents that are usually 5 to 10 pages long. I'm most interested in tagging and categorizing the dollar values. Often, the 100 characters around the string is enough to categorize the dollar value. But in some cases I need input from other parts of the document to do the tagging (ex. earlier in the document it might mention that the document is a lease).

@fgregg has pointed me here to show how you could do this with crfsuite http://nbviewer.ipython.org/github/tpeng/python-crfsuite/blob/master/examples/CoNLL%202002.ipynb but I am wondering if it might be possible with the parserator wrapper. All uninteresting tokens would be tagged as and the interesting ones would be tagged with their proper values.

I wanted to see what you all thought about (1) using parserator in this way (2) adapting parserator to cover such cases. The biggest hold up to using parserator in this way is tagging documents with hundreds and hundreds of tokens. It seems like you would want a small document annotation GUI to generate the XML to train parserator. Do you think that such a GUI should be part of the library? Do you think this would work? Would you be open to a pull request?

Can we extend this to other countries with right labelled data

Is this python API customized for US addresses only?,or can we give the model, training data of other countries with english addresses(through this process https://github.com/datamade/usaddress/tree/master/training) and It still might work ?

Can't Install

Hi there.. I was hoping someone could help me with this issue. I'm not able to install parserator. I know this isn't necessarily a parserator issue, but it would be really helpful to provide some assistance with this.

My intention is to build an address parser for South Africa.

This the error that comes back.

C:\WINDOWS\system32>"C:\Python27\ArcGIS10.4\python.exe" -m pip install parserator
Collecting parserator
  Using cached parserator-0.6.2.tar.gz
Requirement already satisfied (use --upgrade to upgrade): future>=0.14.3 in c:\python27\arcgis10.4\lib\site-packages (from parserator)
Collecting lxml>=3.4.1 (from parserator)
  Using cached lxml-3.6.4.tar.gz
Requirement already satisfied (use --upgrade to upgrade): python-crfsuite>=0.7 in c:\python27\arcgis10.4\lib\site-packages (from parserator)
Collecting backports.csv (from parserator)
  Using cached backports.csv-1.0.2-py2.py3-none-any.whl
Installing collected packages: lxml, backports.csv, parserator
  Running setup.py install for lxml ... error
    Complete output from command C:\Python27\ArcGIS10.4\python.exe -u -c "import setuptools, tokenize;__file__='c:\\users\\pdossa~1\\appdata\\local\\temp\\pip-build-vvxc68\\lxml\\setup.py';exec(compile(getattr(tokenize, 'open', open)(__file__).read().replace('\r\n', '\n'), __file__, 'exec'))" install --record c:\users\pdossa~1\appdata\local\temp\pip-g_btwd-record\install-record.txt --single-version-externally-managed --compile:
    Building lxml version 3.6.4.
    Building without Cython.
    ERROR: 'xslt-config' is not recognized as an internal or external command,
    operable program or batch file.

    ** make sure the development packages of libxml2 and libxslt are installed **

    Using build configuration of libxslt
    running install
    running build
    running build_py
    creating build
    creating build\lib.win32-2.7
    creating build\lib.win32-2.7\lxml
    copying src\lxml\builder.py -> build\lib.win32-2.7\lxml
    copying src\lxml\cssselect.py -> build\lib.win32-2.7\lxml
    copying src\lxml\doctestcompare.py -> build\lib.win32-2.7\lxml
    copying src\lxml\ElementInclude.py -> build\lib.win32-2.7\lxml
    copying src\lxml\pyclasslookup.py -> build\lib.win32-2.7\lxml
    copying src\lxml\sax.py -> build\lib.win32-2.7\lxml
    copying src\lxml\usedoctest.py -> build\lib.win32-2.7\lxml
    copying src\lxml\_elementpath.py -> build\lib.win32-2.7\lxml
    copying src\lxml\__init__.py -> build\lib.win32-2.7\lxml
    creating build\lib.win32-2.7\lxml\includes
    copying src\lxml\includes\__init__.py -> build\lib.win32-2.7\lxml\includes
    creating build\lib.win32-2.7\lxml\html
    copying src\lxml\html\builder.py -> build\lib.win32-2.7\lxml\html
    copying src\lxml\html\clean.py -> build\lib.win32-2.7\lxml\html
    copying src\lxml\html\defs.py -> build\lib.win32-2.7\lxml\html
    copying src\lxml\html\diff.py -> build\lib.win32-2.7\lxml\html
    copying src\lxml\html\ElementSoup.py -> build\lib.win32-2.7\lxml\html
    copying src\lxml\html\formfill.py -> build\lib.win32-2.7\lxml\html
    copying src\lxml\html\html5parser.py -> build\lib.win32-2.7\lxml\html
    copying src\lxml\html\soupparser.py -> build\lib.win32-2.7\lxml\html
    copying src\lxml\html\usedoctest.py -> build\lib.win32-2.7\lxml\html
    copying src\lxml\html\_diffcommand.py -> build\lib.win32-2.7\lxml\html
    copying src\lxml\html\_html5builder.py -> build\lib.win32-2.7\lxml\html
    copying src\lxml\html\_setmixin.py -> build\lib.win32-2.7\lxml\html
    copying src\lxml\html\__init__.py -> build\lib.win32-2.7\lxml\html
    creating build\lib.win32-2.7\lxml\isoschematron
    copying src\lxml\isoschematron\__init__.py -> build\lib.win32-2.7\lxml\isoschematron
    copying src\lxml\lxml.etree.h -> build\lib.win32-2.7\lxml
    copying src\lxml\lxml.etree_api.h -> build\lib.win32-2.7\lxml
    copying src\lxml\includes\c14n.pxd -> build\lib.win32-2.7\lxml\includes
    copying src\lxml\includes\config.pxd -> build\lib.win32-2.7\lxml\includes
    copying src\lxml\includes\dtdvalid.pxd -> build\lib.win32-2.7\lxml\includes
    copying src\lxml\includes\etreepublic.pxd -> build\lib.win32-2.7\lxml\includes
    copying src\lxml\includes\htmlparser.pxd -> build\lib.win32-2.7\lxml\includes
    copying src\lxml\includes\relaxng.pxd -> build\lib.win32-2.7\lxml\includes
    copying src\lxml\includes\schematron.pxd -> build\lib.win32-2.7\lxml\includes
    copying src\lxml\includes\tree.pxd -> build\lib.win32-2.7\lxml\includes
    copying src\lxml\includes\uri.pxd -> build\lib.win32-2.7\lxml\includes
    copying src\lxml\includes\xinclude.pxd -> build\lib.win32-2.7\lxml\includes
    copying src\lxml\includes\xmlerror.pxd -> build\lib.win32-2.7\lxml\includes
    copying src\lxml\includes\xmlparser.pxd -> build\lib.win32-2.7\lxml\includes
    copying src\lxml\includes\xmlschema.pxd -> build\lib.win32-2.7\lxml\includes
    copying src\lxml\includes\xpath.pxd -> build\lib.win32-2.7\lxml\includes
    copying src\lxml\includes\xslt.pxd -> build\lib.win32-2.7\lxml\includes
    copying src\lxml\includes\etree_defs.h -> build\lib.win32-2.7\lxml\includes
    copying src\lxml\includes\lxml-version.h -> build\lib.win32-2.7\lxml\includes
    creating build\lib.win32-2.7\lxml\isoschematron\resources
    creating build\lib.win32-2.7\lxml\isoschematron\resources\rng
    copying src\lxml\isoschematron\resources\rng\iso-schematron.rng -> build\lib.win32-2.7\lxml\isoschematron\resources\rng
    creating build\lib.win32-2.7\lxml\isoschematron\resources\xsl
    copying src\lxml\isoschematron\resources\xsl\RNG2Schtrn.xsl -> build\lib.win32-2.7\lxml\isoschematron\resources\xsl
    copying src\lxml\isoschematron\resources\xsl\XSD2Schtrn.xsl -> build\lib.win32-2.7\lxml\isoschematron\resources\xsl
    creating build\lib.win32-2.7\lxml\isoschematron\resources\xsl\iso-schematron-xslt1
    copying src\lxml\isoschematron\resources\xsl\iso-schematron-xslt1\iso_abstract_expand.xsl -> build\lib.win32-2.7\lxml\isoschematron\resources\xsl\iso-schematron-xslt1
    copying src\lxml\isoschematron\resources\xsl\iso-schematron-xslt1\iso_dsdl_include.xsl -> build\lib.win32-2.7\lxml\isoschematron\resources\xsl\iso-schematron-xslt1
    copying src\lxml\isoschematron\resources\xsl\iso-schematron-xslt1\iso_schematron_message.xsl -> build\lib.win32-2.7\lxml\isoschematron\resources\xsl\iso-schematron-xslt1
    copying src\lxml\isoschematron\resources\xsl\iso-schematron-xslt1\iso_schematron_skeleton_for_xslt1.xsl -> build\lib.win32-2.7\lxml\isoschematron\resources\xsl\iso-schematron-xslt1
    copying src\lxml\isoschematron\resources\xsl\iso-schematron-xslt1\iso_svrl_for_xslt1.xsl -> build\lib.win32-2.7\lxml\isoschematron\resources\xsl\iso-schematron-xslt1
    copying src\lxml\isoschematron\resources\xsl\iso-schematron-xslt1\readme.txt -> build\lib.win32-2.7\lxml\isoschematron\resources\xsl\iso-schematron-xslt1
    running build_ext
    building 'lxml.etree' extension
    creating build\temp.win32-2.7
    creating build\temp.win32-2.7\Release
    creating build\temp.win32-2.7\Release\src
    creating build\temp.win32-2.7\Release\src\lxml
    C:\Users\pdossantos\AppData\Local\Programs\Common\Microsoft\Visual C++ for Python\9.0\VC\Bin\cl.exe /c /nologo /Ox /MD /W3 /GS- /DNDEBUG -Isrc\lxml\includes -IC:\Python27\ArcGIS10.4\include -IC:\Python27\ArcGIS10.4\PC /Tcsrc\lxml\lxml.etree.c /Fobuild\temp.win32-2.7\Release\src\lxml\lxml.etree.obj -w
    cl : Command line warning D9025 : overriding '/W3' with '/w'
    lxml.etree.c
    src\lxml\includes\etree_defs.h(14) : fatal error C1083: Cannot open include file: 'libxml/xmlversion.h': No such file or directory
    Compile failed: command 'C:\\Users\\pdossantos\\AppData\\Local\\Programs\\Common\\Microsoft\\Visual C++ for Python\\9.0\\VC\\Bin\\cl.exe' failed with exit status 2
    creating users
    creating users\pdossa~1
    creating users\pdossa~1\appdata
    creating users\pdossa~1\appdata\local
    creating users\pdossa~1\appdata\local\temp
    C:\Users\pdossantos\AppData\Local\Programs\Common\Microsoft\Visual C++ for Python\9.0\VC\Bin\cl.exe /c /nologo /Ox /MD /W3 /GS- /DNDEBUG -I/usr/include/libxml2 /Tcc:\users\pdossa~1\appdata\local\temp\xmlXPathInitxf8h0v.c /Fousers\pdossa~1\appdata\local\temp\xmlXPathInitxf8h0v.obj
    xmlXPathInitxf8h0v.c
    c:\users\pdossa~1\appdata\local\temp\xmlXPathInitxf8h0v.c(1) : fatal error C1083: Cannot open include file: 'libxml/xpath.h': No such file or directory
    *********************************************************************************
    Could not find function xmlCheckVersion in library libxml2. Is libxml2 installed?
    *********************************************************************************
    error: command 'C:\\Users\\pdossantos\\AppData\\Local\\Programs\\Common\\Microsoft\\Visual C++ for Python\\9.0\\VC\\Bin\\cl.exe' failed with exit status 2

    ----------------------------------------
Command "C:\Python27\ArcGIS10.4\python.exe -u -c "import setuptools, tokenize;__file__='c:\\users\\pdossa~1\\appdata\\local\\temp\\pip-build-vvxc68\\lxml\\setup.py';exec(compile(getattr(tokenize, 'open', open)(__file__).read().replace('\r\n', '\n'), __file__, 'exec'))" install --record c:\users\pdossa~1\appdata\local\temp\pip-g_btwd-record\install-record.txt --single-version-externally-managed --compile" failed with error code 1 in c:\users\pdossa~1\appdata\local\temp\pip-build-vvxc68\lxml\

review documentation w/ fresh eyeballs

Merge pull request from unicode rewrote training.py and borked a number of downstream projects

I'm not exactly sure what happened when #45 "unicode" was merged in, but it appears to have rewritten a lot of things. Notably, training.py, which now lacks the readTrainingData method. This breaks a number of packages which depend on the method, including the training and testing environments for usaddress and probablepeople.

It's possible the problem is limited to this one method, but it might be advisable for the maintainer to doublecheck the merge.

unidecode is GPL

Just FYI: Unidecode >= 0.4.17 listed in install_requires is GPL, so parserator can't be MIT.
Options: use some earlier Unidecode version (it used not to be GPL); use https://github.com/kmike/text-unidecode

Error while training

In a fresh virtualenv (here in python 3 but also bugs in python 2.7, on Ubuntu 14.04).
It crashed with the same error when I tried a project of my own. So I checked if usaddress work on my machine with

git clone https://github.com/datamade/usaddress.git
cd usaddress
pip install -r requirements.txt
python setup.py develop
parserator train training/labeled.xml usaddress

Here is the result of the execution and stack trace :

~/parsertest3/usaddress$ parserator train training/labeled.xml usaddress
/home/darkblue/parsertest3/usaddress/usaddress/init.py:160: UserWarning: You must train the model (parserator train --trainfile FILES) to create the usaddr.crfsuite file before you can use the parse and tag methods
warnings.warn('You must train the model (parserator train --trainfile FILES) to create the %s file before you can use the parse and tag methods' %MODEL_FILE)

training model on 1046 training examples from ['training/labeled.xml']
Traceback (most recent call last):
File "/home/darkblue/parsertest3/bin/parserator", line 11, in
sys.exit(dispatch())
File "/home/darkblue/parsertest3/lib/python3.4/site-packages/parserator/main.py", line 54, in dispatch
args.func(args)
File "/home/darkblue/parsertest3/lib/python3.4/site-packages/parserator/main.py", line 81, in train
training.train(module, train_file_list, modelfile)
File "/home/darkblue/parsertest3/lib/python3.4/site-packages/parserator/training.py", line 96, in train
trainModel(training_data, module, model_file)
File "/home/darkblue/parsertest3/lib/python3.4/site-packages/parserator/training.py", line 33, in trainModel
trainer.train(model_path)
File "pycrfsuite/_pycrfsuite.pyx", line 359, in pycrfsuite._pycrfsuite.BaseTrainer.train (pycrfsuite/_pycrfsuite.cpp:3857)
File "stringsource", line 15, in string.from_py.__pyx_convert_string_from_py_std__in_string (pycrfsuite/_pycrfsuite.cpp:9777)
TypeError: expected bytes, NoneType found

Can you advice ?
Thank you

Have parserator also initialize a `setup.py` for your new parser.

Features used and type of encoding to train the CRF model

Can you direct me to the piece of code which shows the features you guys are extracting from the addresses to train the CRF model. How do you encode the features ?

Need to go back when labeling data

I am labeling data now with parserator. Sometimes I hit the wrong key and want to undo. I don't see a way to do that with the labeler. Would be great to add this feature or make the command line option more obvious.

What is 'of' ? 5
What is 'notless' ? 5
What is 'than' ? #should have pressed 7... not five... 5
` What is '$100,000.00' ?

Issue when running init a new parser

I follow the document and have the following error? Could you help?

I have ensure my windows terminal is running in UTF-8 already:
D:\playground\testParser>python
Python 3.6.6 (v3.6.6:4cf1f54eb7, Jun 27 2018, 03:37:03) [MSC v.1900 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.

import sys
sys.getdefaultencoding()
'utf-8'

D:\playground\testParser>parserator init testParser
Initializing directories for testParser <_io.TextIOWrapper name='' mode='w' encoding='utf-8'>

testParser <_io.TextIOWrapper name='' mode='w' encoding='utf-8'>
raw <_io.TextIOWrapper name='' mode='w' encoding='utf-8'>
training <_io.TextIOWrapper name='' mode='w' encoding='utf-8'>
tests <_io.TextIOWrapper name='' mode='w' encoding='utf-8'>

Generating init.py <_io.TextIOWrapper name='' mode='w' encoding='utf-8'>
Traceback (most recent call last):
File "d:\andre\python\python36\lib\runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "d:\andre\python\python36\lib\runpy.py", line 85, in run_code
exec(code, run_globals)
File "D:\aNDrE\Python\Python36\Scripts\parserator.exe_main.py", line 9, in
File "d:\andre\python\python36\lib\site-packages\parserator\main.py", line 70, in dispatch
args.func(args)
File "d:\andre\python\python36\lib\site-packages\parserator\main.py", line 118, in init
f.write(parser_template.init_template())
File "d:\andre\python\python36\lib\encodings\cp1252.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u3145' in position 252: character maps to

UI suggestion

In training data, it would be helpful if you could reprint the numbers from the console for every new item that needs to be labeled. I often forget the meaning of the numbers when I am 10 or 20 items deep into the labeling...

So like this could be reprinted each time:

Start console labeling!

These are the tags available for labeling:
0 : amendment_amount
1 : amendment_amount_description
2 : agreement_amount
3 : agreement_amount_description
4 : other_amount
5 : other_amount_description
6 : document_self_reference
7 : amount_alphabetic
8 : skip

-bash: parserator: command not found

Running 'parserator init landmarks' generates this error

'parserator: command not found'

I used 'pip install --user parserator'