kohlschutter / boilerpipe Goto Github PK

View Code? Open in Web Editor NEW

1.1K 81.0 290.0 2.32 MB

Work in progress transmit from Google Code

Home Page: https://code.google.com/p/boilerpipe/

License: Other

Java 99.54% Shell 0.46%

boilerpipe's Introduction

boilerpipe

Boilerplate Removal and Fulltext Extraction from HTML pages

NOTE: This is a work-in-progress transmit from Google Code.

The latest stable version of boilerpipe is available at https://code.google.com/p/boilerpipe.

boilerpipe's People

Contributors

Stargazers

Watchers

Forkers

spinn3r ospi darongmean sashavtyurina rygbee anhldbk egbertw mjunaidi fan31415 ajiangcn meinzone shev-pro xiaoling jonathansantilli ramonchu ricardo-rossi turnstyl yewchong downfy chrismoulton vchekan gidim vuquangtin apsaltis markovikic mkolod vladzimir zermelozf wingszero greyblue9 shipsw suzuken kuner duyvk blaz3 newle sridhar-newsdistill hanhanwu ajayk0719 leoyy datavizapril ruziniu texnedo jaimejorge ssi379 lowdev isunman fixee gregors yiqideren dongweibox strogo alialikhan83 landsurveyorsunited daengky fornarat thnguyen names144 lalio lunlun1992 etourdot jenray tuyendothanh spirit-dongdong thomasburkart pb-nickames securextools paheld fivesmallq chenying99 ryosukehigo iveleven wisdark gtfool wilbursun anilkonepalli talk114 nguyentrucdn frodeaa tanthml cswchen tyson925 davidchu201 bytearchive geoffnin m3shark vetional btbytes mllog gaw19 zamzami jlubawy gastongonzalez knowblearticles leiguangr lllding pritambanerjee999 csayogesh cpascal-gr yuany

boilerpipe's Issues

How do I Specify output format in Source Code for extracting Article?

I'm trying as follows,
System.out.println(ArticleExtractor.INSTANCE.getText(url));

It gives me a plain text output. but i want a output as JSON format. Please guide me, to change the output format in source code.

Please remove nekohtml classes from your project

Could you please remove nekohtml classes from your project ? Those should be referenced through dependency, not included into your jar directly. It causes problems when someone tries to build monolithic jar of his project.

Thanks.

Feature request: support for schema.org/Article in ArticleExtractor

(updated)

HTML5 introduces microdata by adding the attributes itemscope, itemid, itemtype, itemprop and itemref. These tags provide valuable information about the semantic role of the parts of a document. This information can also be very useful in parsing the contents of a website as the author intented, rather than by estimating their intent by using statistical or other heuristics.

An effort to standardize the value of these attributes is available on http://schema.org/ which defines various types of documents, such as Article: http://schema.org/Article

One example of a website that uses this effectively that I encountered is http://tweakers.net/. The ArticleExtractor itself does a poor job on this website as it does not only include the article text itself but also includes several (but not all) user comments.

In my setup, I have currently implemented this by first checking for the existence of any HTML elements with a itemprop=articleBody or itemprop=description attribute and using that text when available rather than invoking BoilerPipe, but it would be great if this knowledge could somehow be incorporated into a library such as BoilerPipe that focuses at extracting the article from such a HTML document.

EOL Versions

I am trying to figure out if there are any versions of boilerpipe that are EOL, and if so, when do these versions become EOL?

Time out in HTMLFetcher

In HTMLDocument fetch(final URL url) there is no timeout. Ideally after creating final URLConnection conn = url.openConnection(); time out should be given. Please assign issue to me and I will send a pull request

Go Port of Dom Distiller

For those interested in an updated version of Boilerpipe: the Chromium team based their DOM Distiller library for reader mode on Chrome on boilerpipe. Their program is in Java (like Boilerpipe) and is based on Boilerpipe (similar file structure to the boilerpipe repo etc). However, the library has GWT dependencies and is meant to compile to Javascript.

We ported the Java code of DOM Distiller to Go (without GWT and any Chromium dependencies):

https://github.com/markusmobius/go-domdistiller

It now works fine as a server-side program or command line program - similar to the original Boilerpipe.

The stable branch is the most faithful port of Distiller (everything minus some parts where Distiller relies on some render-level info).

The master branch includes insights from Mozilla's readability.

Is is possible to add `Access-Control-Allow-Origin: *` header to https://boilerpipe-web.appspot.com/

Hello @kohlschuetter,

Thanks for this awesome project and the service on https://boilerpipe-web.appspot.com/, because its powerful feature, browser extensions like https://github.com/Muffo/fullyfeedly could use it to extract articles from websites, but without Access-Control-Allow-Origin header to allow cross-origin resource sharing, this feature can't really work.

As https://boilerpipe-web.appspot.com/ is a public service, I wonder if it's possible to add the header with value *, so that it can work on many places?

Thank you again!

maintain offsets

Hi Kristian, Is it possible for boilerpipe to generate offsets as its output, so that they can be used to reference the original HTML? I looked in ArticleExtractor for something about character or xpath offsets.

Ideally, we'd have something like a list of (xpath + char offset) for each transition, like (junk, title, junk, byline, junk, body, junk, body, junk, body, junk, footnotes).

Thanks for any thoughts on this.

John

hello

hello kohl, greetings from the 3rd world, where can i see the link for the 2.0? is it out yet? is there an email can i contact for support? cant see in the website

Android gradle support

Many thanks for great effort. Hope to see Android gradle support

Extraction Issue

Hello @kohlschuetter ,

First off, I have to say, Boilerpipe is AMAZING! Thank you for your work on this.

In a few cases, I am having a bit of an extraction issue. With the github code, there are some articles where the extraction is starting late. For example, on https://en.wikipedia.org/wiki/New_York_City the output starts at "Further information: Police surveillance in New York City and Crime in New York City". However, when I check that same article on https://boilerpipe-web.appspot.com/, the web API is always getting the full text. I've been banging my head against the wall trying to figure out what I was doing wrong, and just figured I should message the inventor. The only two things I could think of are: 1) I am totally missing something or 2) the web api might slightly different version. Do you what might be going on here?

Hope you are having a great weekend!

Best,
Kevin

Boilerpipe 2.0 is not available in maven repository?

Title content should not be shown in the result

Right now elements in <title> are shown as part of the document.

I'm not sure this is appropriate. <title> is usually invisible content on the page and not really part of the visible text portion of the document.

Perhaps this can be included however we can include it as a field.

However, TextDocument already has a getTitle() method so that can be used and then 'title' element can be ignored as part of the content.

Deploy to maven repository

Please deploy to a central maven repository, e.g. http://central.sonatype.org/pages/ossrh-guide.html, so we can use the 2.0-SNAPSHOT and 1.2 release from maven. It's a bit of a nightmare trying to include static pre-compiled jars in a maven project when other projects depend on it.

p.s. Thanks for the fantastic library. It's working well for my project, even crawling Dutch sites.

mvn compile is broken due to relocation

What is the reason to relocate necohtml? Is it published necohtml with some patches applied? Currently mvn compile fails while mvn package succeeds?!
The implication of this is that when I import the project into IntelliJ, it can't be compiled because if I understand it correctly, relocation is happening after compilation and is not invoked by IDE.

this is a great innovation. why you don't keep maintaining it :(

How to extract both headlines and articles using ARTICLE_EXTRACTOR

Hello,

how to extract both articles and headlines using article extractors like in new article I want to filter out the headline and corresponding article in a pair and store is separately
how can we do that?

Thanks in advance,
regards,
Debanjan

StackOverflowError

cf this page http://ccbdrfc.tripod.com/mpgallery.html

Related discussion http://sourceforge.net/p/nekohtml/bugs/123/

java.lang.StackOverflowError
    at java.util.ArrayList.<init>(ArrayList.java:177)
    at org.cyberneko.html.HTMLTagBalancer.consumeBufferedEndElements(HTMLTagBalancer.java:506)
    at org.cyberneko.html.HTMLTagBalancer.startElement(HTMLTagBalancer.java:589)
    at org.cyberneko.html.HTMLTagBalancer.forceStartElement(HTMLTagBalancer.java:760)
    at org.cyberneko.html.HTMLTagBalancer.startElement(HTMLTagBalancer.java:637)
    at org.cyberneko.html.HTMLTagBalancer.forceStartElement(HTMLTagBalancer.java:760)
    at org.cyberneko.html.HTMLTagBalancer.endElement(HTMLTagBalancer.java:1002)
    at org.cyberneko.html.HTMLTagBalancer.endElement(HTMLTagBalancer.java:1003)
    at org.cyberneko.html.HTMLTagBalancer.endElement(HTMLTagBalancer.java:1003)
    at org.cyberneko.html.HTMLTagBalancer.endElement(HTMLTagBalancer.java:1003)
    at org.cyberneko.html.HTMLTagBalancer.endElement(HTMLTagBalancer.java:1003)
    at org.cyberneko.html.HTMLTagBalancer.endElement(HTMLTagBalancer.java:1003)
    at org.cyberneko.html.HTMLTagBalancer.endElement(HTMLTagBalancer.java:1003)
    at org.cyberneko.html.HTMLTagBalancer.endElement(HTMLTagBalancer.java:1003)