jmriebold / boilerpy3 Goto Github PK
View Code? Open in Web Editor NEWThis project forked from mercuree/boilerpy
Python port of Boilerpipe library
License: Other
This project forked from mercuree/boilerpy
Python port of Boilerpipe library
License: Other
When running the code provided, an error occurred while trying to parse HTML. The error message indicates that there was an issue with the boilerpy3 package, specifically the parse_doc function. The error occurred when trying to pop an element from an empty list, indicating that the parser encountered an unexpected condition during parsing. This caused the parser to fail, and prevented the program from completing successfully.
sample input html is attached
fail1.html.zip
To replicate the issue, the following code was used:
from boilerpy3 import extractors
extractor = extractors.ArticleSentencesExtractor()
doc = extractor.get_doc(x)
page_contents = doc.content
Error parsing HTML
Traceback (most recent call last):
File "/Users/mac/Desktop/django/envs/seotool/lib/python3.9/site-packages/boilerpy3/extractors.py", line 108, in parse_doc
bp_parser.feed(input_str)
File "/Users/mac/Desktop/django/envs/seotool/lib/python3.9/site-packages/boilerpy3/parser.py", line 657, in feed
HTMLParser.feed(self, data)
File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/html/parser.py", line 110, in feed
self.goahead(0)
File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/html/parser.py", line 172, in goahead
k = self.parse_endtag(i)
File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/html/parser.py", line 420, in parse_endtag
self.handle_endtag(elem)
File "/Users/mac/Desktop/django/envs/seotool/lib/python3.9/site-packages/boilerpy3/parser.py", line 667, in handle_endtag
self.end_element(tag)
File "/Users/mac/Desktop/django/envs/seotool/lib/python3.9/site-packages/boilerpy3/parser.py", line 495, in end_element
self.label_stacks.pop()
IndexError: pop from empty list
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/Users/mac/Desktop/django/envs/seotool/lib/python3.9/site-packages/boilerpy3/extractors.py", line 114, in parse_doc
bp_parser.feed(input_str)
File "/Users/mac/Desktop/django/envs/seotool/lib/python3.9/site-packages/boilerpy3/parser.py", line 657, in feed
HTMLParser.feed(self, data)
File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/html/parser.py", line 110, in feed
self.goahead(0)
File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/html/parser.py", line 172, in goahead
k = self.parse_endtag(i)
File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/html/parser.py", line 420, in parse_endtag
self.handle_endtag(elem)
File "/Users/mac/Desktop/django/envs/seotool/lib/python3.9/site-packages/boilerpy3/parser.py", line 667, in handle_endtag
self.end_element(tag)
File "/Users/mac/Desktop/django/envs/seotool/lib/python3.9/site-packages/boilerpy3/parser.py", line 495, in end_element
self.label_stacks.pop()
IndexError: pop from empty list
Traceback (most recent call last):
File "/Users/mac/Desktop/django/envs/seotool/lib/python3.9/site-packages/boilerpy3/extractors.py", line 108, in parse_doc
bp_parser.feed(input_str)
File "/Users/mac/Desktop/django/envs/seotool/lib/python3.9/site-packages/boilerpy3/parser.py", line 657, in feed
HTMLParser.feed(self, data)
File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/html/parser.py", line 110, in feed
self.goahead(0)
File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/html/parser.py", line 172, in goahead
k = self.parse_endtag(i)
File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/html/parser.py", line 420, in parse_endtag
self.handle_endtag(elem)
File "/Users/mac/Desktop/django/envs/seotool/lib/python3.9/site-packages/boilerpy3/parser.py", line 667, in handle_endtag
self.end_element(tag)
File "/Users/mac/Desktop/django/envs/seotool/lib/python3.9/site-packages/boilerpy3/parser.py", line 495, in end_element
self.label_stacks.pop()
IndexError: pop from empty list
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/Users/mac/Desktop/django/envs/seotool/lib/python3.9/site-packages/boilerpy3/extractors.py", line 114, in parse_doc
bp_parser.feed(input_str)
File "/Users/mac/Desktop/django/envs/seotool/lib/python3.9/site-packages/boilerpy3/parser.py", line 657, in feed
HTMLParser.feed(self, data)
File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/html/parser.py", line 110, in feed
self.goahead(0)
File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/html/parser.py", line 172, in goahead
k = self.parse_endtag(i)
File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/html/parser.py", line 420, in parse_endtag
self.handle_endtag(elem)
File "/Users/mac/Desktop/django/envs/seotool/lib/python3.9/site-packages/boilerpy3/parser.py", line 667, in handle_endtag
self.end_element(tag)
File "/Users/mac/Desktop/django/envs/seotool/lib/python3.9/site-packages/boilerpy3/parser.py", line 495, in end_element
self.label_stacks.pop()
IndexError: pop from empty list
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/Applications/PyCharm.app/Contents/plugins/python/helpers/pydev/pydevconsole.py", line 364, in runcode
coro = func()
File "<input>", line 1578, in <module>
File "/Users/mac/Desktop/django/envs/seotool/lib/python3.9/site-packages/boilerpy3/extractors.py", line 63, in get_doc
doc = self.parse_doc(text)
File "/Users/mac/Desktop/django/envs/seotool/lib/python3.9/site-packages/boilerpy3/extractors.py", line 118, in parse_doc
raise HTMLExtractionError from ex
boilerpy3.exceptions.HTMLExtractionError
I want, when I give a link or HTML file to the program, I can finally get the output as HTML and also the links inside it.
Hi,
Sometimes I pass HTML to get_doc, but it returns this warning with empty content:
WARNING:boilerpy3:Warning: SAX input contains nested A elements -- You have probably hit a bug in your HTML parser (e.g., NekoHTML bug #2909310). Please clean the HTML externally and feed it to BoilerPy3 again. Trying to recover somehow..
Any automated tips to fix this?
I am trying to try boilerpy3 with simple warc file. However getting the following error:
Error parsing HTML
Traceback (most recent call last):
File "/home/mani/.python-venvs/research1/lib/python3.8/site-packages/boilerpy3/extractors.py", line 81, in parse_doc
bp_parser.feed(input_str)
File "/home/mani/.python-venvs/research1/lib/python3.8/site-packages/boilerpy3/parser.py", line 652, in feed
self.end_document()
File "/home/mani/.python-venvs/research1/lib/python3.8/site-packages/boilerpy3/parser.py", line 459, in end_document
self.flush_block()
File "/home/mani/.python-venvs/research1/lib/python3.8/site-packages/boilerpy3/parser.py", line 534, in flush_block
if self.last_start_tag.lower() == "title":
AttributeError: 'NoneType' object has no attribute 'lower'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/mani/.python-venvs/research1/lib/python3.8/site-packages/boilerpy3/extractors.py", line 87, in parse_doc
bp_parser.feed(input_str)
File "/home/mani/.python-venvs/research1/lib/python3.8/site-packages/boilerpy3/parser.py", line 652, in feed
self.end_document()
File "/home/mani/.python-venvs/research1/lib/python3.8/site-packages/boilerpy3/parser.py", line 459, in end_document
self.flush_block()
File "/home/mani/.python-venvs/research1/lib/python3.8/site-packages/boilerpy3/parser.py", line 534, in flush_block
if self.last_start_tag.lower() == "title":
AttributeError: 'NoneType' object has no attribute 'lower'
Traceback (most recent call last):
File "/home/mani/Workspace/Researches/Thesis/nparacrawl/npc-miner/main.py", line 23, in
app.create_db()
File "/home/mani/Workspace/Researches/Thesis/nparacrawl/npc-miner/main.py", line 19, in create_db
print(text_extractor.get_content(text))
File "/home/mani/.python-venvs/research1/lib/python3.8/site-packages/boilerpy3/extractors.py", line 33, in get_content
return self.get_doc(text).content
File "/home/mani/.python-venvs/research1/lib/python3.8/site-packages/boilerpy3/extractors.py", line 49, in get_doc
self.filter.process(doc)
File "/home/mani/.python-venvs/research1/lib/python3.8/site-packages/boilerpy3/filters.py", line 98, in process
is_updated |= filtr.process(doc)
File "/home/mani/.python-venvs/research1/lib/python3.8/site-packages/boilerpy3/filters.py", line 859, in process
for tb in doc.text_blocks:
AttributeError: 'NoneType' object has no attribute 'text_blocks'
Example Domain
This domain is established to be used for illustrative examples in documents. You may use this
domain in examples without prior coordination or asking for permission.
I was using this this warc file for testing. It successfully extracts text but leaves with error message. For complex webpages, I am not getting any output except error message. I have tried with python 3.9 and 3.8. Anybody aware of this?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.