Giter VIP home page Giter VIP logo

boilerpipe's Introduction

boilerpipe

Boilerplate Removal and Fulltext Extraction from HTML pages

NOTE: This is a work-in-progress transmit from Google Code.

The latest stable version of boilerpipe is available at https://code.google.com/p/boilerpipe.

boilerpipe's People

Contributors

jonathansantilli avatar kohlschuetter avatar t-mochizuki avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

boilerpipe's Issues

Please remove nekohtml classes from your project

Could you please remove nekohtml classes from your project ? Those should be referenced through dependency, not included into your jar directly. It causes problems when someone tries to build monolithic jar of his project.

Thanks.

Feature request: support for schema.org/Article in ArticleExtractor

(updated)

HTML5 introduces microdata by adding the attributes itemscope, itemid, itemtype, itemprop and itemref. These tags provide valuable information about the semantic role of the parts of a document. This information can also be very useful in parsing the contents of a website as the author intented, rather than by estimating their intent by using statistical or other heuristics.

An effort to standardize the value of these attributes is available on http://schema.org/ which defines various types of documents, such as Article: http://schema.org/Article

One example of a website that uses this effectively that I encountered is http://tweakers.net/. The ArticleExtractor itself does a poor job on this website as it does not only include the article text itself but also includes several (but not all) user comments.

In my setup, I have currently implemented this by first checking for the existence of any HTML elements with a itemprop=articleBody or itemprop=description attribute and using that text when available rather than invoking BoilerPipe, but it would be great if this knowledge could somehow be incorporated into a library such as BoilerPipe that focuses at extracting the article from such a HTML document.

EOL Versions

I am trying to figure out if there are any versions of boilerpipe that are EOL, and if so, when do these versions become EOL?

Time out in HTMLFetcher

In HTMLDocument fetch(final URL url) there is no timeout. Ideally after creating final URLConnection conn = url.openConnection(); time out should be given. Please assign issue to me and I will send a pull request

Go Port of Dom Distiller

For those interested in an updated version of Boilerpipe: the Chromium team based their DOM Distiller library for reader mode on Chrome on boilerpipe. Their program is in Java (like Boilerpipe) and is based on Boilerpipe (similar file structure to the boilerpipe repo etc). However, the library has GWT dependencies and is meant to compile to Javascript.

We ported the Java code of DOM Distiller to Go (without GWT and any Chromium dependencies):

https://github.com/markusmobius/go-domdistiller

It now works fine as a server-side program or command line program - similar to the original Boilerpipe.

The stable branch is the most faithful port of Distiller (everything minus some parts where Distiller relies on some render-level info).

The master branch includes insights from Mozilla's readability.

Is is possible to add `Access-Control-Allow-Origin: *` header to https://boilerpipe-web.appspot.com/

Hello @kohlschuetter,

Thanks for this awesome project and the service on https://boilerpipe-web.appspot.com/, because its powerful feature, browser extensions like https://github.com/Muffo/fullyfeedly could use it to extract articles from websites, but without Access-Control-Allow-Origin header to allow cross-origin resource sharing, this feature can't really work.

As https://boilerpipe-web.appspot.com/ is a public service, I wonder if it's possible to add the header with value *, so that it can work on many places?

Thank you again!

maintain offsets

Hi Kristian, Is it possible for boilerpipe to generate offsets as its output, so that they can be used to reference the original HTML? I looked in ArticleExtractor for something about character or xpath offsets.

Ideally, we'd have something like a list of (xpath + char offset) for each transition, like (junk, title, junk, byline, junk, body, junk, body, junk, body, junk, footnotes).

Thanks for any thoughts on this.

John

hello

hello kohl, greetings from the 3rd world, where can i see the link for the 2.0? is it out yet? is there an email can i contact for support? cant see in the website

Extraction Issue

Hello @kohlschuetter ,

First off, I have to say, Boilerpipe is AMAZING! Thank you for your work on this.

In a few cases, I am having a bit of an extraction issue. With the github code, there are some articles where the extraction is starting late. For example, on https://en.wikipedia.org/wiki/New_York_City the output starts at "Further information: Police surveillance in New York City and Crime in New York City". However, when I check that same article on https://boilerpipe-web.appspot.com/, the web API is always getting the full text. I've been banging my head against the wall trying to figure out what I was doing wrong, and just figured I should message the inventor. The only two things I could think of are: 1) I am totally missing something or 2) the web api might slightly different version. Do you what might be going on here?

Hope you are having a great weekend!

Best,
Kevin

Title content should not be shown in the result

Right now elements in <title> are shown as part of the document.

I'm not sure this is appropriate. <title> is usually invisible content on the page and not really part of the visible text portion of the document.

Perhaps this can be included however we can include it as a field.

However, TextDocument already has a getTitle() method so that can be used and then 'title' element can be ignored as part of the content.

Deploy to maven repository

Please deploy to a central maven repository, e.g. http://central.sonatype.org/pages/ossrh-guide.html, so we can use the 2.0-SNAPSHOT and 1.2 release from maven. It's a bit of a nightmare trying to include static pre-compiled jars in a maven project when other projects depend on it.

p.s. Thanks for the fantastic library. It's working well for my project, even crawling Dutch sites.

mvn compile is broken due to relocation

What is the reason to relocate necohtml? Is it published necohtml with some patches applied? Currently mvn compile fails while mvn package succeeds?!
The implication of this is that when I import the project into IntelliJ, it can't be compiled because if I understand it correctly, relocation is happening after compilation and is not invoked by IDE.

StackOverflowError

cf this page http://ccbdrfc.tripod.com/mpgallery.html

Related discussion http://sourceforge.net/p/nekohtml/bugs/123/

java.lang.StackOverflowError
    at java.util.ArrayList.<init>(ArrayList.java:177)
    at org.cyberneko.html.HTMLTagBalancer.consumeBufferedEndElements(HTMLTagBalancer.java:506)
    at org.cyberneko.html.HTMLTagBalancer.startElement(HTMLTagBalancer.java:589)
    at org.cyberneko.html.HTMLTagBalancer.forceStartElement(HTMLTagBalancer.java:760)
    at org.cyberneko.html.HTMLTagBalancer.startElement(HTMLTagBalancer.java:637)
    at org.cyberneko.html.HTMLTagBalancer.forceStartElement(HTMLTagBalancer.java:760)
    at org.cyberneko.html.HTMLTagBalancer.endElement(HTMLTagBalancer.java:1002)
    at org.cyberneko.html.HTMLTagBalancer.endElement(HTMLTagBalancer.java:1003)
    at org.cyberneko.html.HTMLTagBalancer.endElement(HTMLTagBalancer.java:1003)
    at org.cyberneko.html.HTMLTagBalancer.endElement(HTMLTagBalancer.java:1003)
    at org.cyberneko.html.HTMLTagBalancer.endElement(HTMLTagBalancer.java:1003)
    at org.cyberneko.html.HTMLTagBalancer.endElement(HTMLTagBalancer.java:1003)
    at org.cyberneko.html.HTMLTagBalancer.endElement(HTMLTagBalancer.java:1003)
    at org.cyberneko.html.HTMLTagBalancer.endElement(HTMLTagBalancer.java:1003)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.