Giter VIP home page Giter VIP logo

webstruct's Introduction

Webstruct

PyPI Version

Build Status

Code Coverage

Documentation

Webstruct is a library for creating statistical NER systems that work on HTML data, i.e. a library for building tools that extract named entities (addresses, organization names, open hours, etc) from webpages.

Unlike most NER systems, webstruct works on HTML data, not only on text data. This allows to define features that use HTML structure, and also to embed annotation results back into HTML.

Read the docs for more info.

License is MIT.

Contributing

To run tests, make sure tox is installed, then run tox from the source root.

webstruct's People

Contributors

carlosp420 avatar chekunkov avatar kebniss avatar kmike avatar shaneaevans avatar sibiryakov avatar suor avatar tpeng avatar whalebot-helmsman avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

webstruct's Issues

One good example beginning to end

Your tool looks like what I'm looking for, but the documentation is so limited, I can't use it. Just one screencast or example would do the trick.
All I want to know is how to train something to use with NER. You suggest using WebAnnotator, and you provide code to load trees out of the files saved from WebAnnotator, but you stop there. Why not follow through with a complete example that shows how to extract the content based on that model?
Thanks,
-jim

Tutorial results in obscure errors.

I've been following webstruct tutorial and I'm getting few peculiar errors.
From the tutorial I end up with code along the lines of this:

from itertools import islice
import pkg_resources
import webstruct


def token_identity(html_token):
    return {'token': html_token.token}


def token_isupper(html_token):
    return {'isupper': html_token.token.isupper()}


def parent_tag(html_token):
    return {'parent_tag': html_token.parent.tag}


def border_at_left(html_token):
    return {'border_at_left': html_token.index == 0}


DATA_DIR = pkg_resources.resource_filename('project', 'data/business_annotated')


def get_training():
    trees = webstruct.load_trees("{}/*.html".format(DATA_DIR), webstruct.WebAnnotatorLoader())
    trees = islice(trees, 0, 10)  # todo
    return trees


def tokenize_training(trees):
    html_tokenizer = webstruct.HtmlTokenizer()
    tokens, labels = html_tokenizer.tokenize(trees)
    return tokens, labels


def main():
    print('creating model...')
    model = webstruct.create_wapiti_pipeline(
        'company.wapiti',
        token_features=[token_identity, token_isupper, parent_tag, border_at_left],
        train_args='--algo l-bfgs --maxiter 50 --compact',
    )
    print('getting training data...')
    tokens, labels = tokenize_training(get_training())
    print('fitting training data...')
    model.fit(tokens, labels)
    print('starting extract...')
    ner = webstruct.NER(model)
    print(ner.extract_from_url('http://scrapinghub.com/contact'))

if __name__ == '__main__':
    main()

The first error I get is TypeError when trying to use extract something with ner:

Traceback (most recent call last):
  File "/home/dex/projects/project/project/spiders/test.py", line 54, in <module>
    main()
  File "/home/dex/projects/project/project/spiders/test.py", line 51, in main
    print(ner.extract_from_url('http://scrapinghub.com/contact'))
  File "/home/dex/.virtualenvs/people/lib/python3.6/site-packages/webstruct/model.py", line 58, in extract_from_url
    return self.extract(data)
  File "/home/dex/.virtualenvs/people/lib/python3.6/site-packages/webstruct/model.py", line 46, in extract
    groups = IobEncoder.group(zip(html_tokens, tags))
  File "/home/dex/.virtualenvs/people/lib/python3.6/site-packages/webstruct/sequence_encoding.py", line 128, in group
    return list(cls.iter_group(data, strict))
  File "/home/dex/.virtualenvs/people/lib/python3.6/site-packages/webstruct/sequence_encoding.py", line 136, in iter_group
    if iob_tag.startswith('I-') and tag != iob_tag[2:]:
TypeError: startswith first arg must be bytes or a tuple of bytes, not str

It seems like python3 support issue as it's expects bytes but get a string?

Second error is when trying to build a ner straight from model without fitting it first:

def main():
    print('creating model...')
    model = webstruct.create_wapiti_pipeline(
        'company.wapiti',
        token_features=[token_identity, token_isupper, parent_tag, border_at_left],
        train_args='--algo l-bfgs --maxiter 50 --compact',
    )
    # print('getting training data...')
    # tokens, labels = tokenize_training(get_training())
    # print('fitting training data...')
    # model.fit(tokens, labels)
    # print('starting extract...')
    ner = webstruct.NER(model)
    print(ner.extract_from_url('http://scrapinghub.com/contact'))

Results in:

Traceback (most recent call last):
  File "/home/dex/projects/project/project/spiders/test.py", line 53, in <module>
    main()
  File "/home/dex/projects/project/project/spiders/test.py", line 50, in main
    print(ner.extract_from_url('http://scrapinghub.com/contact'))
  File "/home/dex/.virtualenvs/people/lib/python3.6/site-packages/webstruct/model.py", line 58, in extract_from_url
    return self.extract(data)
  File "/home/dex/.virtualenvs/people/lib/python3.6/site-packages/webstruct/model.py", line 45, in extract
    html_tokens, tags = self.extract_raw(bytes_data)
  File "/home/dex/.virtualenvs/people/lib/python3.6/site-packages/webstruct/model.py", line 67, in extract_raw
    tags = self.model.predict([html_tokens])[0]
  File "/home/dex/.virtualenvs/people/lib/python3.6/site-packages/sklearn/utils/metaestimators.py", line 54, in <lambda>
    out = lambda *args, **kwargs: self.fn(obj, *args, **kwargs)
  File "/home/dex/.virtualenvs/people/lib/python3.6/site-packages/sklearn/pipeline.py", line 327, in predict
    return self.steps[-1][-1].predict(Xt)
  File "/home/dex/.virtualenvs/people/lib/python3.6/site-packages/webstruct/wapiti.py", line 211, in predict
    sequences = self._to_wapiti_sequences(X)
  File "/home/dex/.virtualenvs/people/lib/python3.6/site-packages/webstruct/wapiti.py", line 230, in _to_wapiti_sequences
    X = self.feature_encoder.transform(X)
  File "/home/dex/.virtualenvs/people/lib/python3.6/site-packages/webstruct/wapiti.py", line 313, in transform
    return [self.transform_single(feature_dicts) for feature_dicts in X]
  File "/home/dex/.virtualenvs/people/lib/python3.6/site-packages/webstruct/wapiti.py", line 313, in <listcomp>
    return [self.transform_single(feature_dicts) for feature_dicts in X]
  File "/home/dex/.virtualenvs/people/lib/python3.6/site-packages/webstruct/wapiti.py", line 308, in transform_single
    line = ' '.join(_tostr(dct.get(key)) for key in self.feature_names_)
TypeError: 'NoneType' object is not iterable

The errors seem to be very vague and I don't even know where to start debugging this. Am I missing something?

I'm running:
webstruct - 0.5
scikit-learn - 0.18.2
scipy - 0.19
libwapiti - 0.2.1

`load_trees()` raises exception

>>> import webstruct
>>> list(webstruct.load_trees('data/*.html', webstruct.HtmlLoader()))
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-21-4694a0768f78> in <module>()
----> 1 list(webstruct.load_trees('data/*.html', webstruct.HtmlLoader()))

/home/suor/projects/.virtualenvs/advices/local/lib/python2.7/site-packages/webstruct/loaders.pyc in <genexpr>(***failed resolving arguments***)
    160     """
    161     return chain.from_iterable(
--> 162         load_trees_from_files(pat, loader, verbose) for pat, loader in patterns
    163     )
    164 

ValueError: need more than 1 value to unpack

In webstuct 0.2.

Not possible to annotate <button>, <option> etc. elements

Using the Web Annotater Firefox extension it is not possible to annotate text which is the descendant of certain (interactive) html elements. I have noticed <button> and <option> so far but there may be others.

This can lead to (apparent) false positives when predictions are made on text belonging to these elements, which will affect the model, and the resultant metrics.

I firstly wanted to confirm that it is indeed impossible to add annotations to these elements, and if so, I have two questions:

  1. What is the cleanest way to remove these tags?
  2. Do you think this should become a webstruct default?

Within the HtmlTokenizer constructor we have a few options available but none seem suitable for this task.

  • ignore_html_tags will ignore the element and its children, but will also remove any tail text e.g.

<html><body>start<option>hello</option>end</body></html> the text "end" will be lost here

  • kill_html_tags will drop the element and its children and preserve tail text, but this requires keep_child = False, and this parameter is not exposed at the class level.

I would be happy to create a PR/tests for this if required.

Python 3 support

Do you intend to get it? If you do I can help updating your code to be both python 2 and 3 compatible.

Get rid of scikit-learn dependencies

Without scikit-learn and seqlearn it will be possible to install webstruct without numpy/scipy stack and without Cython. They are used for auxilary things: nicer __repr__s, Pipeline instead of hand-written classes, metrics (which are broken anyways, see #14) - that's nothing serious or hard to replace.

Pretrained Models

Webstruct looks like a really cool extension to have for any scraping enthusiast, so thank you for creating this!
It would be really awesome if you guys could also release some pre-trained models along with this library. It's not feasible for every user to have loads of annotated data and what people generally are looking for are the most common entities (NAME, PLACE, ORGANISATION, etc). A humble suggestion ๐Ÿ˜„

to_webannotator may fail if an attribute value of some HTML element contains a control character

Traceback (after trying to NER.annotate() https://github.com/scrapinghub/webstruct/blob/master/webstruct_data/corpus/business_pages/source/301.html page):

ValueError                                Traceback (most recent call last)
<ipython-input-8-45ad24ffcda1> in <module>()
      9     try:
     10         with open(fn, 'rb') as f:
---> 11             annotated = ner.annotate(f.read())
     12 
     13         path, filename = os.path.split(fn)

/Users/kmike/svn/webstruct/webstruct/model.pyc in annotate(self, bytes_data, pretty_print)
    105         html_tokens, tags = self.extract_raw(bytes_data)
    106         tree = self.html_tokenizer.detokenize_single(html_tokens, tags)
--> 107         tree = to_webannotator(tree, self.entity_colors)
    108         return tostring(tree, pretty_print=pretty_print)
    109 

/Users/kmike/svn/webstruct/webstruct/webannotator.py in to_webannotator(tree, entity_colors)
    258     """
    259     handler = _WaContentHandler(entity_colors)
--> 260     lxml.sax.saxify(tree, handler)
    261     tree = handler.out.etree
    262     _copy_title(tree)

/Users/kmike/envs/scraping/lib/python2.7/site-packages/lxml/sax.pyc in saxify(element_or_tree, content_handler)
    245     them against a SAX ContentHandler.
    246     """
--> 247     return ElementTreeProducer(element_or_tree, content_handler).saxify()

/Users/kmike/envs/scraping/lib/python2.7/site-packages/lxml/sax.pyc in saxify(self)
    178                 self._recursive_saxify(sibling, {})
    179 
--> 180         self._recursive_saxify(element, {})
    181 
    182         if hasattr(element, 'getnext'):

/Users/kmike/envs/scraping/lib/python2.7/site-packages/lxml/sax.pyc in _recursive_saxify(self, element, prefixes)
    224             content_handler.characters(element.text)
    225         for child in element:
--> 226             self._recursive_saxify(child, prefixes)
    227         content_handler.endElementNS((ns_uri, local_name), qname)
    228         for prefix, uri in new_prefixes:

/Users/kmike/envs/scraping/lib/python2.7/site-packages/lxml/sax.pyc in _recursive_saxify(self, element, prefixes)
    224             content_handler.characters(element.text)
    225         for child in element:
--> 226             self._recursive_saxify(child, prefixes)
    227         content_handler.endElementNS((ns_uri, local_name), qname)
    228         for prefix, uri in new_prefixes:

/Users/kmike/envs/scraping/lib/python2.7/site-packages/lxml/sax.pyc in _recursive_saxify(self, element, prefixes)
    224             content_handler.characters(element.text)
    225         for child in element:
--> 226             self._recursive_saxify(child, prefixes)
    227         content_handler.endElementNS((ns_uri, local_name), qname)
    228         for prefix, uri in new_prefixes:

/Users/kmike/envs/scraping/lib/python2.7/site-packages/lxml/sax.pyc in _recursive_saxify(self, element, prefixes)
    224             content_handler.characters(element.text)
    225         for child in element:
--> 226             self._recursive_saxify(child, prefixes)
    227         content_handler.endElementNS((ns_uri, local_name), qname)
    228         for prefix, uri in new_prefixes:

/Users/kmike/envs/scraping/lib/python2.7/site-packages/lxml/sax.pyc in _recursive_saxify(self, element, prefixes)
    224             content_handler.characters(element.text)
    225         for child in element:
--> 226             self._recursive_saxify(child, prefixes)
    227         content_handler.endElementNS((ns_uri, local_name), qname)
    228         for prefix, uri in new_prefixes:

/Users/kmike/envs/scraping/lib/python2.7/site-packages/lxml/sax.pyc in _recursive_saxify(self, element, prefixes)
    224             content_handler.characters(element.text)
    225         for child in element:
--> 226             self._recursive_saxify(child, prefixes)
    227         content_handler.endElementNS((ns_uri, local_name), qname)
    228         for prefix, uri in new_prefixes:

/Users/kmike/envs/scraping/lib/python2.7/site-packages/lxml/sax.pyc in _recursive_saxify(self, element, prefixes)
    224             content_handler.characters(element.text)
    225         for child in element:
--> 226             self._recursive_saxify(child, prefixes)
    227         content_handler.endElementNS((ns_uri, local_name), qname)
    228         for prefix, uri in new_prefixes:

/Users/kmike/envs/scraping/lib/python2.7/site-packages/lxml/sax.pyc in _recursive_saxify(self, element, prefixes)
    224             content_handler.characters(element.text)
    225         for child in element:
--> 226             self._recursive_saxify(child, prefixes)
    227         content_handler.endElementNS((ns_uri, local_name), qname)
    228         for prefix, uri in new_prefixes:

/Users/kmike/envs/scraping/lib/python2.7/site-packages/lxml/sax.pyc in _recursive_saxify(self, element, prefixes)
    224             content_handler.characters(element.text)
    225         for child in element:
--> 226             self._recursive_saxify(child, prefixes)
    227         content_handler.endElementNS((ns_uri, local_name), qname)
    228         for prefix, uri in new_prefixes:

/Users/kmike/envs/scraping/lib/python2.7/site-packages/lxml/sax.pyc in _recursive_saxify(self, element, prefixes)
    224             content_handler.characters(element.text)
    225         for child in element:
--> 226             self._recursive_saxify(child, prefixes)
    227         content_handler.endElementNS((ns_uri, local_name), qname)
    228         for prefix, uri in new_prefixes:

/Users/kmike/envs/scraping/lib/python2.7/site-packages/lxml/sax.pyc in _recursive_saxify(self, element, prefixes)
    224             content_handler.characters(element.text)
    225         for child in element:
--> 226             self._recursive_saxify(child, prefixes)
    227         content_handler.endElementNS((ns_uri, local_name), qname)
    228         for prefix, uri in new_prefixes:

/Users/kmike/envs/scraping/lib/python2.7/site-packages/lxml/sax.pyc in _recursive_saxify(self, element, prefixes)
    224             content_handler.characters(element.text)
    225         for child in element:
--> 226             self._recursive_saxify(child, prefixes)
    227         content_handler.endElementNS((ns_uri, local_name), qname)
    228         for prefix, uri in new_prefixes:

/Users/kmike/envs/scraping/lib/python2.7/site-packages/lxml/sax.pyc in _recursive_saxify(self, element, prefixes)
    224             content_handler.characters(element.text)
    225         for child in element:
--> 226             self._recursive_saxify(child, prefixes)
    227         content_handler.endElementNS((ns_uri, local_name), qname)
    228         for prefix, uri in new_prefixes:

/Users/kmike/envs/scraping/lib/python2.7/site-packages/lxml/sax.pyc in _recursive_saxify(self, element, prefixes)
    224             content_handler.characters(element.text)
    225         for child in element:
--> 226             self._recursive_saxify(child, prefixes)
    227         content_handler.endElementNS((ns_uri, local_name), qname)
    228         for prefix, uri in new_prefixes:

/Users/kmike/envs/scraping/lib/python2.7/site-packages/lxml/sax.pyc in _recursive_saxify(self, element, prefixes)
    224             content_handler.characters(element.text)
    225         for child in element:
--> 226             self._recursive_saxify(child, prefixes)
    227         content_handler.endElementNS((ns_uri, local_name), qname)
    228         for prefix, uri in new_prefixes:

/Users/kmike/envs/scraping/lib/python2.7/site-packages/lxml/sax.pyc in _recursive_saxify(self, element, prefixes)
    224             content_handler.characters(element.text)
    225         for child in element:
--> 226             self._recursive_saxify(child, prefixes)
    227         content_handler.endElementNS((ns_uri, local_name), qname)
    228         for prefix, uri in new_prefixes:

/Users/kmike/envs/scraping/lib/python2.7/site-packages/lxml/sax.pyc in _recursive_saxify(self, element, prefixes)
    220             content_handler.startPrefixMapping(prefix, uri)
    221         content_handler.startElementNS((ns_uri, local_name),
--> 222                                        qname, sax_attributes)
    223         if element.text:
    224             content_handler.characters(element.text)

/Users/kmike/svn/webstruct/webstruct/webannotator.py in startElementNS(self, name, qname, attributes)
    122         self._closeSpan()
    123         # print('start %s' % qname)
--> 124         self.out.startElementNS(name, qname, attributes)
    125         self._openSpan()
    126 

/Users/kmike/envs/scraping/lib/python2.7/site-packages/lxml/sax.pyc in startElementNS(self, ns_name, qname, attributes)
    110         else:
    111             element = SubElement(element_stack[-1], el_name,
--> 112                                  attrs, self._new_mappings)
    113         element_stack.append(element)
    114 

/Users/kmike/envs/scraping/lib/python2.7/site-packages/lxml/etree.so in lxml.etree.SubElement (src/lxml/lxml.etree.c:67070)()

/Users/kmike/envs/scraping/lib/python2.7/site-packages/lxml/etree.so in lxml.etree._makeSubElement (src/lxml/lxml.etree.c:15492)()

/Users/kmike/envs/scraping/lib/python2.7/site-packages/lxml/etree.so in lxml.etree._makeSubElement (src/lxml/lxml.etree.c:15423)()

/Users/kmike/envs/scraping/lib/python2.7/site-packages/lxml/etree.so in lxml.etree._initNodeAttributes (src/lxml/lxml.etree.c:16529)()

/Users/kmike/envs/scraping/lib/python2.7/site-packages/lxml/etree.so in lxml.etree._addAttributeToNode (src/lxml/lxml.etree.c:16701)()

/Users/kmike/envs/scraping/lib/python2.7/site-packages/lxml/etree.so in lxml.etree._utf8 (src/lxml/lxml.etree.c:26485)()

ValueError: All strings must be XML compatible: Unicode or ASCII, no NULL bytes or control characters

don't use conda on travis

Installation of numpy/scipy/scikit-learn should be fast with wheels; it seems there is no more reasons to use conda on Travis.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.