slaveofcode / boilerpipe3 Goto Github PK

View Code? Open in Web Editor NEW

44.0 44.0 15.0 7.41 MB

A fork of boilerpipe with python 3 and small fixes, ported from source `https://pypi.python.org/pypi/boilerpipe-py3.

Python 100.00%

boilerpipe3's People

Contributors

Stargazers

Watchers

Forkers

emavgl 2youngkim zergey datahack-ru derlin mothirajha motazsaad alexkillgur devvver emanuel1025 autonomiq isanvicente goryszewskig highcat dreamseakik

boilerpipe3's Issues

Depracation error when using boilerpipe3 with JPype 0.8

Using the Extractor leads to the following error, triggered by the lack of keyword argument convertStrings when boilerpipe launches the JVM.

Deprecated: convertStrings was not specified when starting the JVM. The default
behavior in JPype will be False starting in JPype 0.8. The recommended setting
for new code is convertStrings=False. The legacy value of True was assumed for
this session. If you are a user of an application that reported this warning,
please file a ticket with the developer.

This can be fixed by passing the argument convertStrings=False on line 10 of init() when the JVM is started. This will result in Java strings being returned which will need to be converted to Python strings on output.

UnicodeDecodeError

Hi,

When I try to extract an article from varzesh3.com (for example https://www.varzesh3.com/news/1554055/) I get this error:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
NameError: name 'your_url' is not defined
>>> your_url = 'https://www.varzesh3.com/news/1554055/'
>>> extractor = Extractor(extractor='ArticleExtractor', url=your_url)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.6/dist-packages/boilerpipe/extract/__init__.py", line 46, in __init__
    self.data = str(self.data, encoding)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte

I solved this by replacing line 46 with:
self.data = self.data.decode(encoding, "ignore")

java.lang.OutOfMemoryError: Java heap space

Hi!

I've been using Boilerpipe with Bitextor, and everything has worked out fine. The problem is that when I processed a PDF file, specifically this one, I run out of memory and the execution failed. The error message I got is:

Traceback (most recent call last):                                                                                                                                                                           
  File "BoilerpipeSAXInput.java", line 51, in de.l3s.boilerpipe.sax.BoilerpipeSAXInput.getTextDocument                                                                                                       
  File "BoilerpipeSAXInput.java", line 63, in de.l3s.boilerpipe.sax.BoilerpipeSAXInput.getTextDocument                                                                                                       
  File "org.apache.xerces.parsers.AbstractSAXParser.java", line -1, in org.apache.xerces.parsers.AbstractSAXParser.parse                                                                                     
  File "org.apache.xerces.parsers.XMLParser.java", line -1, in org.apache.xerces.parsers.XMLParser.parse                                                                                                     
  File "HTMLConfiguration.java", line 452, in org.cyberneko.html.HTMLConfiguration.parse                                                                                                                     
  File "HTMLConfiguration.java", line 499, in org.cyberneko.html.HTMLConfiguration.parse                                                                                                                     
  File "HTMLScanner.java", line 907, in org.cyberneko.html.HTMLScanner.scanDocument                                                                                                                          
  File "HTMLScanner.java", line 1967, in org.cyberneko.html.HTMLScanner$ContentScanner.scan                                                                                                                  
  File "HTMLScanner.java", line 2291, in org.cyberneko.html.HTMLScanner$ContentScanner.scanCharacters                                                                                                        
  File "DefaultFilter.java", line 152, in org.cyberneko.html.filters.DefaultFilter.characters                                                                                                                
  File "HTMLTagBalancer.java", line 954, in org.cyberneko.html.HTMLTagBalancer.characters                                                                                                                    
  File "org.apache.xerces.parsers.AbstractSAXParser.java", line -1, in org.apache.xerces.parsers.AbstractSAXParser.characters                                                                                
  File "BoilerpipeHTMLContentHandler.java", line 293, in de.l3s.boilerpipe.sax.BoilerpipeHTMLContentHandler.characters                                                                                       
  File "BitSet.java", line 447, in java.util.BitSet.set                                                                                                                                                      
  File "BitSet.java", line 352, in java.util.BitSet.expandTo                                                                                                                                                 
  File "BitSet.java", line 337, in java.util.BitSet.ensureCapacity                                                                                                                                           
  File "Arrays.java", line 3308, in java.util.Arrays.copyOf                                                                                                                                                  
Exception: Java Exception                                                                                                                                                                                    
                                                                                                                                                                                                             
The above exception was the direct cause of the following exception:                                                                                                                                         
                                                                                                                                                                                                             
Traceback (most recent call last):                                                                                                                                                                           
  File "<stdin>", line 1, in <module>                                                                                                                                                                          File "/home/cgarcia/miniconda3/envs/bitextor/lib/python3.8/site-packages/boilerpipe/extract/__init__.py", line 67, in __init__                                                                                 self.source = BoilerpipeSAXInput(InputSource(reader)).getTextDocument()                                                                                                                                  
java.lang.OutOfMemoryError: java.lang.OutOfMemoryError: Java heap space

In order to get rid of Bitextor for the explanation of this issue, I attach to this issue the file which Bitextor generated from the PDF, which is an HTML, and the attached HTML is the one that causes this problem. The file size is 9.4 MB, which I don't know if is a size too big to make Boilerpipe fail. The problem is not related to the PDF itself, since I processed other PDFs and the process finished without errors.

In the end, I figured out that the problem was actually due to the memory (initially I though about a memory leak), what was really weird to me since it is a 9.4 MB file. I fixed the problem increasing the quantity of memory of jpype. The total quantity of memory which a 9.4 MB HTML file required was of ~52 GB!!!!!!! My system has 126 GB, so the default max. heap size of the JVM is 30 GB. Since the process was requiring 52 GB and the max. heap size was 30 GB, I was running out of memory.

The reason of this issue is to alert other people which might have the same problem and to ask the following question: do these numbers make sense? I mean, 52 GB of memory for an HTML file of 9.4 MB?

The code which triggers the error:

from boilerpipe.extract import Extractor

text = ""

with open("boilerpipe_error.html") as f:
  for l in f:
    text += l

text = text.strip()

Extractor(extractor='ArticleExtractor', html=text)

The fix (run before the above code; it should work, but I haven't tested it out of the actual file, so I might have miss something):

import os
import jpype
import importlib

# Take 80 GB of memory for boilerpipe
boilerpipe_max_heap_size = 80 * 1024 # TODO change this value

if not jpype.isJVMStarted():
    max_heap_size = f"-Xmx{str(options.boilerpipe_max_heap_size)}M" if options.boilerpipe_max_heap_size >= 0 else ''
    jars = []

    for top, dirs, files in os.walk(os.path.dirname(importlib.machinery.PathFinder().find_module("boilerpipe").get_filename()) + '/data'):
        for nm in files:
            if nm[-4:] == ".jar":
                jars.append(os.path.join(top, nm))

    jpype.addClassPath(os.pathsep.join(jars))

    jargs = [jpype.getDefaultJVMPath()]

    if max_heap_size != '':
        jargs.append(max_heap_size)

    jpype.startJVM(*jargs, convertStrings=False)

# ... run boilerpipe

html.tar.gz

KeepEverythingWithMinKWordsExtractor not working

First, thanks for the port.

When trying to use KeepEverythingWithMinKWordsExtractor, I get the error:

Traceback (most recent call last):
  File "test.py", line 4, in <module>
    extractor = Extractor(extractor='KeepEverythingWithMinKWordsExtractor', url=url, kMin=20)
  File "/private/tmp/html_extract/venv/lib/python3.6/site-packages/boilerpipe/extract/__init__.py", line 62, in __init__
    "de.l3s.boilerpipe.extractors."+extractor).INSTANCE
AttributeError: type object 'de.l3s.boilerpipe.extractors.KeepEverythingWithMin' has no attribute 'INSTANCE'

The problem is that the KeepEverythingWithMinKWordsExtractor constructor takes an argument (see the java code).

To fix this, line 60 in extract/__init__.py should be replaced with:

if extractor == "KeepEverythingWithMinKWordsExtractor":
   # handle argument
    kMin = kwargs.get("kMin", 1)  # set default to 1
    self.extractor = jpype.JClass(
            "de.l3s.boilerpipe.extractors."+extractor)(kMin)
else:
    self.extractor = jpype.JClass(
        "de.l3s.boilerpipe.extractors."+extractor).INSTANCE

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.