scrapinghub / webstruct Goto Github PK

NER toolkit for HTML data

Python 0.34% HTML 99.66%

webstruct's Introduction

Webstruct

Webstruct is a library for creating statistical NER systems that work on HTML data, i.e. a library for building tools that extract named entities (addresses, organization names, open hours, etc) from webpages.

Unlike most NER systems, webstruct works on HTML data, not only on text data. This allows to define features that use HTML structure, and also to embed annotation results back into HTML.

Read the docs for more info.

License is MIT.

Contributing

Source code: https://github.com/scrapinghub/webstruct
Bug tracker: https://github.com/scrapinghub/webstruct/issues

To run tests, make sure tox is installed, then run tox from the source root.

webstruct's People

Contributors

Stargazers

Watchers

webstruct's Issues

One good example beginning to end

Your tool looks like what I'm looking for, but the documentation is so limited, I can't use it. Just one screencast or example would do the trick.
All I want to know is how to train something to use with NER. You suggest using WebAnnotator, and you provide code to load trees out of the files saved from WebAnnotator, but you stop there. Why not follow through with a complete example that shows how to extract the content based on that model?
Thanks,
-jim

add element's relative position feature of its parent

i.e.

    parent = elem.getparent()
    pos = parent.index(elem)

this should capture the case where first element is the subject and second is the content

Very slow in some cases

For a HTML obtained from http://media.pella.com/professional/adm/Clad-Wood/CRNFMDTR-dl.dxf (~7.5 Mb of text-only) tokenizing takes more than 1 hour.

Preserved file in https://gist.github.com/whalebot-helmsman/987d98c092294aeeafd8735a13a37c32

Tutorial results in obscure errors.

I've been following webstruct tutorial and I'm getting few peculiar errors.
From the tutorial I end up with code along the lines of this:

from itertools import islice
import pkg_resources
import webstruct


def token_identity(html_token):
    return {'token': html_token.token}


def token_isupper(html_token):
    return {'isupper': html_token.token.isupper()}


def parent_tag(html_token):
    return {'parent_tag': html_token.parent.tag}


def border_at_left(html_token):
    return {'border_at_left': html_token.index == 0}


DATA_DIR = pkg_resources.resource_filename('project', 'data/business_annotated')


def get_training():
    trees = webstruct.load_trees("{}/*.html".format(DATA_DIR), webstruct.WebAnnotatorLoader())
    trees = islice(trees, 0, 10)  # todo
    return trees


def tokenize_training(trees):
    html_tokenizer = webstruct.HtmlTokenizer()
    tokens, labels = html_tokenizer.tokenize(trees)
    return tokens, labels


def main():
    print('creating model...')
    model = webstruct.create_wapiti_pipeline(
        'company.wapiti',
        token_features=[token_identity, token_isupper, parent_tag, border_at_left],
        train_args='--algo l-bfgs --maxiter 50 --compact',
    )
    print('getting training data...')
    tokens, labels = tokenize_training(get_training())
    print('fitting training data...')
    model.fit(tokens, labels)
    print('starting extract...')
    ner = webstruct.NER(model)
    print(ner.extract_from_url('http://scrapinghub.com/contact'))

if __name__ == '__main__':
    main()

The first error I get is TypeError when trying to use extract something with ner:

Traceback (most recent call last):
  File "/home/dex/projects/project/project/spiders/test.py", line 54, in <module>
    main()
  File "/home/dex/projects/project/project/spiders/test.py", line 51, in main
    print(ner.extract_from_url('http://scrapinghub.com/contact'))
  File "/home/dex/.virtualenvs/people/lib/python3.6/site-packages/webstruct/model.py", line 58, in extract_from_url
    return self.extract(data)
  File "/home/dex/.virtualenvs/people/lib/python3.6/site-packages/webstruct/model.py", line 46, in extract
    groups = IobEncoder.group(zip(html_tokens, tags))
  File "/home/dex/.virtualenvs/people/lib/python3.6/site-packages/webstruct/sequence_encoding.py", line 128, in group
    return list(cls.iter_group(data, strict))
  File "/home/dex/.virtualenvs/people/lib/python3.6/site-packages/webstruct/sequence_encoding.py", line 136, in iter_group
    if iob_tag.startswith('I-') and tag != iob_tag[2:]:
TypeError: startswith first arg must be bytes or a tuple of bytes, not str

It seems like python3 support issue as it's expects bytes but get a string?

Second error is when trying to build a ner straight from model without fitting it first:

def main():
    print('creating model...')
    model = webstruct.create_wapiti_pipeline(
        'company.wapiti',
        token_features=[token_identity, token_isupper, parent_tag, border_at_left],
        train_args='--algo l-bfgs --maxiter 50 --compact',
    )
    # print('getting training data...')
    # tokens, labels = tokenize_training(get_training())
    # print('fitting training data...')
    # model.fit(tokens, labels)
    # print('starting extract...')
    ner = webstruct.NER(model)
    print(ner.extract_from_url('http://scrapinghub.com/contact'))

Results in:

Traceback (most recent call last):
  File "/home/dex/projects/project/project/spiders/test.py", line 53, in <module>
    main()
  File "/home/dex/projects/project/project/spiders/test.py", line 50, in main
    print(ner.extract_from_url('http://scrapinghub.com/contact'))
  File "/home/dex/.virtualenvs/people/lib/python3.6/site-packages/webstruct/model.py", line 58, in extract_from_url
    return self.extract(data)
  File "/home/dex/.virtualenvs/people/lib/python3.6/site-packages/webstruct/model.py", line 45, in extract
    html_tokens, tags = self.extract_raw(bytes_data)
  File "/home/dex/.virtualenvs/people/lib/python3.6/site-packages/webstruct/model.py", line 67, in extract_raw
    tags = self.model.predict([html_tokens])[0]
  File "/home/dex/.virtualenvs/people/lib/python3.6/site-packages/sklearn/utils/metaestimators.py", line 54, in <lambda>
    out = lambda *args, **kwargs: self.fn(obj, *args, **kwargs)
  File "/home/dex/.virtualenvs/people/lib/python3.6/site-packages/sklearn/pipeline.py", line 327, in predict
    return self.steps[-1][-1].predict(Xt)
  File "/home/dex/.virtualenvs/people/lib/python3.6/site-packages/webstruct/wapiti.py", line 211, in predict
    sequences = self._to_wapiti_sequences(X)
  File "/home/dex/.virtualenvs/people/lib/python3.6/site-packages/webstruct/wapiti.py", line 230, in _to_wapiti_sequences
    X = self.feature_encoder.transform(X)
  File "/home/dex/.virtualenvs/people/lib/python3.6/site-packages/webstruct/wapiti.py", line 313, in transform
    return [self.transform_single(feature_dicts) for feature_dicts in X]
  File "/home/dex/.virtualenvs/people/lib/python3.6/site-packages/webstruct/wapiti.py", line 313, in <listcomp>
    return [self.transform_single(feature_dicts) for feature_dicts in X]
  File "/home/dex/.virtualenvs/people/lib/python3.6/site-packages/webstruct/wapiti.py", line 308, in transform_single
    line = ' '.join(_tostr(dct.get(key)) for key in self.feature_names_)
TypeError: 'NoneType' object is not iterable

The errors seem to be very vague and I don't even know where to start debugging this. Am I missing something?

I'm running:
webstruct - 0.5
scikit-learn - 0.18.2
scipy - 0.19
libwapiti - 0.2.1

Default text tokenizer is slow

https://github.com/scrapinghub/webstruct/blob/master/webstruct/text_tokenizers.py is a bottleneck for HtmlTokenizer (and for feature extraction in general):

# this take 5.35s
X, y = HtmlTokenizer().tokenize(trees)

# this takes 1.18s
X, y = HtmlTokenizer(text_tokenize_func=lambda t: t.split()).tokenize(trees)

bio_classification_report is broken with scikit-learn master

Unpickled models may fail to load if moved to different location

WapitiCRF shoudl handle wapiti model files better, because it is hard to use/move an unpickled model (save them in-memory and create in temporary locations for wapiti?);

`load_trees()` raises exception

>>> import webstruct
>>> list(webstruct.load_trees('data/*.html', webstruct.HtmlLoader()))
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-21-4694a0768f78> in <module>()
----> 1 list(webstruct.load_trees('data/*.html', webstruct.HtmlLoader()))

/home/suor/projects/.virtualenvs/advices/local/lib/python2.7/site-packages/webstruct/loaders.pyc in <genexpr>(***failed resolving arguments***)
    160     """
    161     return chain.from_iterable(
--> 162         load_trees_from_files(pat, loader, verbose) for pat, loader in patterns
    163     )
    164 

ValueError: need more than 1 value to unpack

In webstuct 0.2.

Update tutorial using python-crfsuite

Currently tutorial and intro use python-wapiti, but usually python-crfsuite is a better option.

Not possible to annotate <button>, <option> etc. elements

Using the Web Annotater Firefox extension it is not possible to annotate text which is the descendant of certain (interactive) html elements. I have noticed <button> and <option> so far but there may be others.

This can lead to (apparent) false positives when predictions are made on text belonging to these elements, which will affect the model, and the resultant metrics.

I firstly wanted to confirm that it is indeed impossible to add annotations to these elements, and if so, I have two questions:

What is the cleanest way to remove these tags?
Do you think this should become a webstruct default?

Within the HtmlTokenizer constructor we have a few options available but none seem suitable for this task.

ignore_html_tags will ignore the element and its children, but will also remove any tail text e.g.

<html><body>start<option>hello</option>end</body></html> the text "end" will be lost here

kill_html_tags will drop the element and its children and preserve tail text, but this requires keep_child = False, and this parameter is not exposed at the class level.

I would be happy to create a PR/tests for this if required.

Python 3 support

Do you intend to get it? If you do I can help updating your code to be both python 2 and 3 compatible.

Get rid of scikit-learn dependencies

Without scikit-learn and seqlearn it will be possible to install webstruct without numpy/scipy stack and without Cython. They are used for auxilary things: nicer __repr__s, Pipeline instead of hand-written classes, metrics (which are broken anyways, see #14) - that's nothing serious or hard to replace.

Pretrained Models

Webstruct looks like a really cool extension to have for any scraping enthusiast, so thank you for creating this!
It would be really awesome if you guys could also release some pre-trained models along with this library. It's not feasible for every user to have loads of annotated data and what people generally are looking for are the most common entities (NAME, PLACE, ORGANISATION, etc). A humble suggestion 😄

add DIRECTIONS entity to the tag set and annotate the data

Directions are commonly present on contact pages.

to_webannotator may fail if an attribute value of some HTML element contains a control character

Traceback (after trying to NER.annotate() https://github.com/scrapinghub/webstruct/blob/master/webstruct_data/corpus/business_pages/source/301.html page):

ValueError                                Traceback (most recent call last)
<ipython-input-8-45ad24ffcda1> in <module>()
      9     try:
     10         with open(fn, 'rb') as f:
---> 11             annotated = ner.annotate(f.read())
     12 
     13         path, filename = os.path.split(fn)

/Users/kmike/svn/webstruct/webstruct/model.pyc in annotate(self, bytes_data, pretty_print)
    105         html_tokens, tags = self.extract_raw(bytes_data)
    106         tree = self.html_tokenizer.detokenize_single(html_tokens, tags)
--> 107         tree = to_webannotator(tree, self.entity_colors)
    108         return tostring(tree, pretty_print=pretty_print)
    109 

/Users/kmike/svn/webstruct/webstruct/webannotator.py in to_webannotator(tree, entity_colors)
    258     """
    259     handler = _WaContentHandler(entity_colors)
--> 260     lxml.sax.saxify(tree, handler)
    261     tree = handler.out.etree
    262     _copy_title(tree)

/Users/kmike/envs/scraping/lib/python2.7/site-packages/lxml/sax.pyc in saxify(element_or_tree, content_handler)
    245     them against a SAX ContentHandler.
    246     """
--> 247     return ElementTreeProducer(element_or_tree, content_handler).saxify()

/Users/kmike/envs/scraping/lib/python2.7/site-packages/lxml/sax.pyc in saxify(self)
    178                 self._recursive_saxify(sibling, {})
    179 
--> 180         self._recursive_saxify(element, {})
    181 
    182         if hasattr(element, 'getnext'):

/Users/kmike/envs/scraping/lib/python2.7/site-packages/lxml/sax.pyc in _recursive_saxify(self, element, prefixes)
    224             content_handler.characters(element.text)
    225         for child in element:
--> 226             self._recursive_saxify(child, prefixes)
    227         content_handler.endElementNS((ns_uri, local_name), qname)
    228         for prefix, uri in new_prefixes:

/Users/kmike/envs/scraping/lib/python2.7/site-packages/lxml/sax.pyc in _recursive_saxify(self, element, prefixes)
    224             content_handler.characters(element.text)
    225         for child in element:
--> 226             self._recursive_saxify(child, prefixes)
    227         content_handler.endElementNS((ns_uri, local_name), qname)
    228         for prefix, uri in new_prefixes:

/Users/kmike/envs/scraping/lib/python2.7/site-packages/lxml/sax.pyc in _recursive_saxify(self, element, prefixes)
    224             content_handler.characters(element.text)
    225         for child in element:
--> 226             self._recursive_saxify(child, prefixes)
    227         content_handler.endElementNS((ns_uri, local_name), qname)
    228         for prefix, uri in new_prefixes:

/Users/kmike/envs/scraping/lib/python2.7/site-packages/lxml/sax.pyc in _recursive_saxify(self, element, prefixes)
    224             content_handler.characters(element.text)
    225         for child in element:
--> 226             self._recursive_saxify(child, prefixes)
    227         content_handler.endElementNS((ns_uri, local_name), qname)
    228         for prefix, uri in new_prefixes:

/Users/kmike/envs/scraping/lib/python2.7/site-packages/lxml/sax.pyc in _recursive_saxify(self, element, prefixes)
    224             content_handler.characters(element.text)
    225         for child in element:
--> 226             self._recursive_saxify(child, prefixes)
    227         content_handler.endElementNS((ns_uri, local_name), qname)
    228         for prefix, uri in new_prefixes:

/Users/kmike/envs/scraping/lib/python2.7/site-packages/lxml/sax.pyc in _recursive_saxify(self, element, prefixes)
    224             content_handler.characters(element.text)
    225         for child in element:
--> 226             self._recursive_saxify(child, prefixes)
    227         content_handler.endElementNS((ns_uri, local_name), qname)
    228         for prefix, uri in new_prefixes:

/Users/kmike/envs/scraping/lib/python2.7/site-packages/lxml/sax.pyc in _recursive_saxify(self, element, prefixes)
    224             content_handler.characters(element.text)
    225         for child in element:
--> 226             self._recursive_saxify(child, prefixes)
    227         content_handler.endElementNS((ns_uri, local_name), qname)
    228         for prefix, uri in new_prefixes:

/Users/kmike/envs/scraping/lib/python2.7/site-packages/lxml/sax.pyc in _recursive_saxify(self, element, prefixes)
    224             content_handler.characters(element.text)
    225         for child in element:
--> 226             self._recursive_saxify(child, prefixes)
    227         content_handler.endElementNS((ns_uri, local_name), qname)
    228         for prefix, uri in new_prefixes:

/Users/kmike/envs/scraping/lib/python2.7/site-packages/lxml/sax.pyc in _recursive_saxify(self, element, prefixes)
    224             content_handler.characters(element.text)
    225         for child in element:
--> 226             self._recursive_saxify(child, prefixes)
    227         content_handler.endElementNS((ns_uri, local_name), qname)
    228         for prefix, uri in new_prefixes:

/Users/kmike/envs/scraping/lib/python2.7/site-packages/lxml/sax.pyc in _recursive_saxify(self, element, prefixes)
    224             content_handler.characters(element.text)
    225         for child in element:
--> 226             self._recursive_saxify(child, prefixes)
    227         content_handler.endElementNS((ns_uri, local_name), qname)
    228         for prefix, uri in new_prefixes:

/Users/kmike/envs/scraping/lib/python2.7/site-packages/lxml/sax.pyc in _recursive_saxify(self, element, prefixes)
    224             content_handler.characters(element.text)
    225         for child in element:
--> 226             self._recursive_saxify(child, prefixes)
    227         content_handler.endElementNS((ns_uri, local_name), qname)
    228         for prefix, uri in new_prefixes:

/Users/kmike/envs/scraping/lib/python2.7/site-packages/lxml/sax.pyc in _recursive_saxify(self, element, prefixes)
    224             content_handler.characters(element.text)
    225         for child in element:
--> 226             self._recursive_saxify(child, prefixes)
    227         content_handler.endElementNS((ns_uri, local_name), qname)
    228         for prefix, uri in new_prefixes:

/Users/kmike/envs/scraping/lib/python2.7/site-packages/lxml/sax.pyc in _recursive_saxify(self, element, prefixes)
    224             content_handler.characters(element.text)
    225         for child in element:
--> 226             self._recursive_saxify(child, prefixes)
    227         content_handler.endElementNS((ns_uri, local_name), qname)
    228         for prefix, uri in new_prefixes:

/Users/kmike/envs/scraping/lib/python2.7/site-packages/lxml/sax.pyc in _recursive_saxify(self, element, prefixes)
    224             content_handler.characters(element.text)
    225         for child in element:
--> 226             self._recursive_saxify(child, prefixes)
    227         content_handler.endElementNS((ns_uri, local_name), qname)
    228         for prefix, uri in new_prefixes:

/Users/kmike/envs/scraping/lib/python2.7/site-packages/lxml/sax.pyc in _recursive_saxify(self, element, prefixes)
    224             content_handler.characters(element.text)
    225         for child in element:
--> 226             self._recursive_saxify(child, prefixes)
    227         content_handler.endElementNS((ns_uri, local_name), qname)
    228         for prefix, uri in new_prefixes:

/Users/kmike/envs/scraping/lib/python2.7/site-packages/lxml/sax.pyc in _recursive_saxify(self, element, prefixes)
    224             content_handler.characters(element.text)
    225         for child in element:
--> 226             self._recursive_saxify(child, prefixes)
    227         content_handler.endElementNS((ns_uri, local_name), qname)
    228         for prefix, uri in new_prefixes:

/Users/kmike/envs/scraping/lib/python2.7/site-packages/lxml/sax.pyc in _recursive_saxify(self, element, prefixes)
    220             content_handler.startPrefixMapping(prefix, uri)
    221         content_handler.startElementNS((ns_uri, local_name),
--> 222                                        qname, sax_attributes)
    223         if element.text:
    224             content_handler.characters(element.text)

/Users/kmike/svn/webstruct/webstruct/webannotator.py in startElementNS(self, name, qname, attributes)
    122         self._closeSpan()
    123         # print('start %s' % qname)
--> 124         self.out.startElementNS(name, qname, attributes)
    125         self._openSpan()
    126 

/Users/kmike/envs/scraping/lib/python2.7/site-packages/lxml/sax.pyc in startElementNS(self, ns_name, qname, attributes)
    110         else:
    111             element = SubElement(element_stack[-1], el_name,
--> 112                                  attrs, self._new_mappings)
    113         element_stack.append(element)
    114 

/Users/kmike/envs/scraping/lib/python2.7/site-packages/lxml/etree.so in lxml.etree.SubElement (src/lxml/lxml.etree.c:67070)()

/Users/kmike/envs/scraping/lib/python2.7/site-packages/lxml/etree.so in lxml.etree._makeSubElement (src/lxml/lxml.etree.c:15492)()

/Users/kmike/envs/scraping/lib/python2.7/site-packages/lxml/etree.so in lxml.etree._makeSubElement (src/lxml/lxml.etree.c:15423)()

/Users/kmike/envs/scraping/lib/python2.7/site-packages/lxml/etree.so in lxml.etree._initNodeAttributes (src/lxml/lxml.etree.c:16529)()

/Users/kmike/envs/scraping/lib/python2.7/site-packages/lxml/etree.so in lxml.etree._addAttributeToNode (src/lxml/lxml.etree.c:16701)()

/Users/kmike/envs/scraping/lib/python2.7/site-packages/lxml/etree.so in lxml.etree._utf8 (src/lxml/lxml.etree.c:26485)()

ValueError: All strings must be XML compatible: Unicode or ASCII, no NULL bytes or control characters

scrapinghub / webstruct Goto Github PK

webstruct's Introduction

Webstruct

Contributing

webstruct's People

Contributors

Stargazers

Watchers

Forkers

webstruct's Issues

Recommend Projects

Recommend Topics

Recommend Org