Giter VIP home page Giter VIP logo

Comments (9)

GoogleCodeExporter avatar GoogleCodeExporter commented on June 2, 2024
The ArticleExtractor appears to assumes li elements are part of a menu of some 
sort.  Generally this is correct, but it seems we can assume that menus aren 
normally not ordered lists?

Working from that assumption, I was able to modify the code to accept li 
elements that are in an ol into 1 textblock adding the order number before each 
li.  I have not had the chance to test my modifications against a wide variety 
of articles, but it seems to work as expected.

Original comment by [email protected] on 15 Mar 2012 at 10:03

from boilerpipe.

GoogleCodeExporter avatar GoogleCodeExporter commented on June 2, 2024
I think it is relatively safe to assume that, but at the same time I doubt this 
is always the case, some obscure reasons may push a developer to utilise <ol>'s 
instead of <ul>'s.

Also, it is very common for articles to have <ul>'s inside of them - is there 
something that can be added that looks for leading/trailing content blocks 
greater than X length. Or you could alternatively look for lists inside of a 
element which also contain a large volume of text?

Its is unlikely that a div would contain lists as well as a large volume of 
text (where the text is outside of the list element).

Hope that all makes sense?

Original comment by [email protected] on 16 Mar 2012 at 7:12

from boilerpipe.

GoogleCodeExporter avatar GoogleCodeExporter commented on June 2, 2024
Partially fixed in r170. The issue with LIs still needs to be tackled, although 
there are other reasons for this behavior.

Please try again and tell me if you are happy with the results.

Cheers,
Christian

Original comment by ckkohl79 on 21 Mar 2012 at 10:10

  • Changed state: Fixed

from boilerpipe.

GoogleCodeExporter avatar GoogleCodeExporter commented on June 2, 2024
I tried this using 
http://boilerpipe-web.appspot.com/extract?url=http%3A%2F%2Fwww.seomoz.org%2Fugc%
2Flink-building-management&extractor=ArticleExtractor&output=htmlFragment and 
it doesn't seem to have made much of a difference - has the appspot version 
been updated yet?

Notice how they have used an ol and a ul within the same article.

Original comment by [email protected] on 22 Mar 2012 at 8:48

from boilerpipe.

GoogleCodeExporter avatar GoogleCodeExporter commented on June 2, 2024
Please try again. It's now live on boilerpipe-web.
(before, it was only on SVN trunk)

Original comment by ckkohl79 on 22 Mar 2012 at 5:48

from boilerpipe.

GoogleCodeExporter avatar GoogleCodeExporter commented on June 2, 2024
Looking much better, the only issue remaining is it seems to have trimmed out 
two of the <ol> <li>'s - 
http://boilerpipe-web.appspot.com/extract?url=http%3A%2F%2Fwww.seomoz.org%2Fugc%
2Flink-building-management&extractor=ArticleExtractor&output=htmlFragment

These two:
Majestic SEO - Deeper than OSE but contains noisy, unfiltered data.
Official Google Toolbar (PageRank) - Single metric. Infrequently updated.

Great work by the way, works almost perfectly now :-)

Original comment by [email protected] on 22 Mar 2012 at 5:53

from boilerpipe.

GoogleCodeExporter avatar GoogleCodeExporter commented on June 2, 2024
[deleted comment]

from boilerpipe.

GoogleCodeExporter avatar GoogleCodeExporter commented on June 2, 2024
Can something like the ArticleMetadataFilter be used to remove the "82 Thumbs 
Up, 1 Thumbs Down" block?

Original comment by [email protected] on 22 Mar 2012 at 8:34

from boilerpipe.

GoogleCodeExporter avatar GoogleCodeExporter commented on June 2, 2024
Sorry tucker was that directed at me?

Original comment by [email protected] on 26 Mar 2012 at 2:59

from boilerpipe.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.