gravitylabs / goose Goto Github PK
View Code? Open in Web Editor NEWHtml Content / Article Extractor in Scala - open sourced from Gravity Labs
Home Page: http://gravity.com
License: Apache License 2.0
Html Content / Article Extractor in Scala - open sourced from Gravity Labs
Home Page: http://gravity.com
License: Apache License 2.0
I haven't made the jump to Scala, I'm on basically the last Java version, so feel free to tell me to do my own debugging (I have some customizations that would need porting), but I'm seeing an issue with nytimes urls. For example, from today's front page:
Is somehow being redirected behind the scenes to www10.nytimes.com which always lands me on the login screen if I use that canonical url. It's correctly extracting the content and top image, etc, but it's got the wrong domain and canonical URL, which makes me think it's something nyt changes in their paywall bouncing redirs?
Is it just me? Maybe I'll just rewrite the canonical url and forget about it for now.
Great work as always!
Hi,
I'm trying to use Goose in a java environment and am having some trouble. Using the example found on the front page, (goose.extractContent...) but found that it creates a thread every time I call this method. To remove these threads I have to stop the global Goose.crawlingActor() object, which of course prevents access from any other thread, this is not acceptable. I found an example in the test directory, StaticHTMLTest.java, which is exactly what I want, but it's all commented out. Trying to instantiate ContentExtractor as shown in this example produces the "com.gravity.goose.extractors.ContentExtractor is abstract; cannot be instantiated" . How can I just run the content extractor without starting threads (or only use one extra thread)? Will there be more valid examples included in the source tree soon?
Thanks
content will be sucessfully extracted using viewtex.org api
e.g., http://viewtext.org/article?url=http://violetiva.pixnet.net/blog/post/28652839-%5B%E9%A3%9F%E8%A8%98%5D-%E5%8F%B0%E5%8C%97%E2%80%A7%E6%%209D%8F%E5%AD%90%E8%B1%AC%E6%8E%92-
but get nothing using goose api , is there something wrong?
Hi, I was just trying goose out on some Chinese language news sites, and it doesn't appear to be able to pull any article text. Examples:
http://news.xhby.net/system/2011/10/03/011788372.shtml
http://news.iqilu.com/shehui/huahuashijie/20111003/565892.html
Will your algorithm work on Chinese with a minor fix or does it need to be a latin language?
Thanks,
Joel
When using goose through proose there is one static instance of ContentExtractor and there are multiple calls to extractContent().
However, when you do this the return from getTopImage() always gives the same answer - the best image from the first url passed to extractContent().
Fix is to ensure that a new ImageExtractor class is created on each call to extractContent(). I have put a patch together below which fixes the problem.
index 3a287c1..f16d9ba 100644
--- a/src/main/java/com/jimplush/goose/ContentExtractor.java
+++ b/src/main/java/com/jimplush/goose/ContentExtractor.java
@@ -64,8 +64,6 @@ public class ContentExtractor {
// once we have our topNode then we want to format that guy for output to the user
private OutputFormatter outputFormatter;
- private ImageExtractor imageExtractor;
-
/**
* you can optionally pass in a configuration object here that will allow you to override the settings
@@ -121,7 +119,7 @@ public class ContentExtractor {
if (config.isEnableImageFetching()) {
HttpClient httpClient = HtmlFetcher.getHttpClient();
- imageExtractor = getImageExtractor(httpClient, urlToCrawl);
+ ImageExtractor imageExtractor = getImageExtractor(httpClient, urlToCrawl);
article.setTopImage(imageExtractor.getBestImage(doc, article.getTopNode()));
}
@@ -170,12 +168,8 @@ public class ContentExtractor {
private ImageExtractor getImageExtractor(HttpClient httpClient, String urlToCrawl) {
- if (imageExtractor == null) {
BestImageGuesser bestImageGuesser = new BestImageGuesser(this.config, httpClient, urlToCrawl);
Info level should be for administrators/user information, not debugging information. It will help clean up the logs
I want to select a element whose class start with some specified string. for example
[img class="test-124" src="..." ]
[img class="test-child-121" src="..." ]
[img class="test-123" src="..." ]
[img class="test-122" src="..." ]
I want to select all the above element.
How can i do that?
regards,
Jaya
I created a simple scala project to test out goose, but i cant seem to extract any content. My code is below. I will say that this is my first time using scala, so i expect that to be related. Any help here is greatly appreciated.
import com.gravity.goose._
import org.jsoup.nodes._
object Reader {
def main(args : Array[String]) {
val url = args(0)
val html = ""
println("Retrieving URL : " + url)
val goose = new Goose(new Configuration)
val article = goose.extractContent(url, html)
println(article.cleanedArticleText)
}
}
I've noticed that goose doesn't work for an italian site because that site use the "article" tag.
An example url is the following:
http://www.repubblica.it/economia/2012/05/12/news/giovani_anziani_asili_nido_e_soldi_per_il_sud_ecco_il_progetto_del_governo_per_l_equit-34962952/
Note that i work with italian stopwords.
I solved (for this site) adding
nodesToCheck.addAll(doc.getElementsByTag("article"))
here: https://github.com/jiminoc/goose/blob/master/src/main/scala/com/gravity/goose/extractors/ContentExtractor.scala#L387
and changing:
if (e.tagName != "p" && e.tagName != "article")
here
https://github.com/jiminoc/goose/blob/master/src/main/scala/com/gravity/goose/extractors/ContentExtractor.scala#L509
Can this solution works for every site or it mess up something?
Thank you
Running goose against the url "http://www.apo-rot.de/indexdetails.html?_filterartnr=1997030&partnerid=preisfuerst"
we recieve the following NPE:
java.lang.NullPointerException
at com.gravity.goose.cleaners.DocumentCleaner$class.cleanBadTags(DocumentCleaner.scala:132)
at com.gravity.goose.cleaners.DocumentCleaner$class.clean(DocumentCleaner.scala:51)
at com.gravity.goose.cleaners.StandardDocumentCleaner.clean(StandardDocumentCleaner.scala:26)
at com.gravity.goose.Crawler$$anonfun$crawl$1$$anonfun$apply$1$$anonfun$apply$2.apply(Crawler.scala:71)
at com.gravity.goose.Crawler$$anonfun$crawl$1$$anonfun$apply$1$$anonfun$apply$2.apply(Crawler.scala:48)
Goose needs a way to filter out janky article titles where the title may have multiple delimiters such as
Breaking News: KCAL05: This just in - some guy won a million bucks
It gets confusing where to separate the titles from the prefix. It would be nice to have a text file that you can add special cases to where you can put in the text to replace with blanks
example:
domain replace
kcal9.com Breaking News: KCAL05:
http://www.cnn.com/2012/11/07/politics/why-romney-lost/index.html returns http://i.cdn.turner.com/cnn/.e/img/3.0/mosaic/bttn_close.gif as its main image, probably because it's the first image inside the cnn_strycntntlft div.
This page has a good og:image tag, but that is overridden by the site-specific data.
When parsing some articles some paragraphs can appear in wrong order after the DocumentCleaner went trough them.
In src/main/scala/com/gravity/goose/cleaners/DocumentCleaner.scala the method convertDivsToParagraphs
When child elements of a div are parsed for text-nodes, all the text nodes are appended to a single text string and all text nodes are later removed and a new paragraph node is added as the first child of the div. This can mess up order if the div contains multiple other elements such a paragraphs or spans. The problem is that the text nodes are collected into a single paragraph that is always added as the first child. Some text nodes can be interwoven between other paragraph nodes which has the effect that text that originally appeared after a paragraph now appears in front of it.
You can see the effect when cleaning this page: http://danielspicar.github.com/goose-bug.html
Note how TextNode 1 and TextNode 2 are merged into one paragraph at the beginning of the div container. This has the effect that TextNode 2 now appears before Paragraph 1 (if you don't see TextNode 2, it's now part of the text in the first paragraph).
In the original page TextNode 2 followed after Paragraph 1.
I see that nytimes.com is on your list of sites that still need unit testing.
I've successfully installed Goose and have run the unit tests without trouble.
Here are two issues I found while trying to extract text from nytimes.com:
INFO [main] (HtmlFetcher.java:203) - Initializing HttpClient
INFO [main] (DefaultRequestDirector.java:491) - I/O exception (java.net.SocketException) caught when processing request: Connection reset
INFO [main] (DefaultRequestDirector.java:498) - Retrying request
INFO [main] (DefaultRequestDirector.java:491) - I/O exception (java.net.SocketException) caught when processing request: Connection reset
INFO [main] (DefaultRequestDirector.java:498) - Retrying request
INFO [main] (DefaultRequestDirector.java:491) - I/O exception (java.net.SocketException) caught when processing request: Connection reset
INFO [main] (DefaultRequestDirector.java:498) - Retrying request
WARN [main] (HtmlFetcher.java:132) - Connection reset
INFO [main] (HtmlFetcher.java:159) - starting...
INFO [main] (HtmlFetcher.java:161) - HTMLRESULT is empty or null
I'm not a Java guy, but when I tried to do something similar in Python, I discovered that setting the User-agent header causes the response from nytimes.com to send a 301 status and never reaches a 200 status. If I comment out line 244 in HtmlFetcher.java in Goose in order to not set the User-agent header and run the code, it successfully gets a response with article text. But then I see the 2nd problem:
It would be a great feature to be able to find the next page and/or the single page version automatically.
After a while my log becomes full of these messages and it just hangs this didn't appear to be an issue in the previous version :
INFO com.gravity.goose.images.ImageUtils - org.apache.http.conn.ConnectionPoolTimeoutException: Timeout waiting for connection
....
at com.gravity.goose.images.ImageUtils$.fetchEntity(ImageUtils.scala:267)
at com.gravity.goose.images.ImageUtils$.storeImageToLocalFile(ImageUtils.scala:172)
at com.gravity.goose.images.UpgradedImageIExtractor.getLocallyStoredImage(UpgradedImageIExtractor.scala:465)
at com.gravity.goose.images.UpgradedImageIExtractor$$anonfun$com$gravity$goose$images$UpgradedImageIExtractor$$findImagesThatPassByteSizeTest$1.apply(UpgradedImageIExtractor.scala:348)
at com.gravity.goose.images.UpgradedImageIExtractor$$anonfun$com$gravity$goose$images$UpgradedImageIExtractor$$findImagesThatPassByteSizeTest$1.apply(UpgradedImageIExtractor.scala:341)
at scala.collection.Iterator$class.foreach(Iterator.scala:652)
However I was calling ImageExtractor directly and now I'm using com.gravity.goose.Goose.extractContent and it appears to be calling UpgradedImageExtractor ....
com.gravity.goose.images.UpgradedImageIExtractor.com$gravity$goose$images$UpgradedImageIExtractor$$findImagesThatPassByteSizeTest(UpgradedImageIExtractor.scala:341)
Using topNode, Goose throws away parts of the text that do not score well. Each paragraph is a separate div, and Goose remove those that are too short.
The image confidence score is purely based on the number of images retrieved, but has no relationship to the relative scores of the images. Confidence score should reflect the difference in the scores of the various images retrieved, as well as the absolute score of the image.
For example, if several images have a similar score, we are not confident which one is correct, but if 10 images have poor scores and one has a good score, we can be fairly confidence.
Hey,
It seems that Akka repository is changed to this: http://repo.akka.io/releases/
I couldn't build with the repository you used in the pom file.
/Amir
Hello,
How can I use this in java since Android does not support Scala?
thanks.
I have a pipeline that use Goose to download text form news articles. Sometimes can happen that a bad article stop the pipeline because Goose can't download images and raise exceptions for every image every 2 minutes, like in the following log example.
May 7, 2012 3:12:13 AM com.gravity.goose.utils.Logging$class info
INFO: com.gravity.goose.network.ImageFetchException: com.gravity.goose.network.ImageFetchException ==> Failed to fetch image file from imgSrc: http://rss.feedsportal.com/flickr/foto/2011_01/5350399008_96bfb1d665_s.jpg
May 7, 2012 3:14:13 AM com.gravity.goose.utils.Logging$class info
INFO: com.gravity.goose.network.ImageFetchException: com.gravity.goose.network.ImageFetchException ==> Failed to fetch image file from imgSrc: http://rss.feedsportal.com/flickr/foto/2011_09/6165889887_57c08896b9_s.jpg
May 7, 2012 3:16:13 AM com.gravity.goose.utils.Logging$class info
INFO: com.gravity.goose.network.ImageFetchException: com.gravity.goose.network.ImageFetchException ==> Failed to fetch image file from imgSrc: http://rss.feedsportal.com/flickr/foto/2010_10/4935053263_ef10f461c5_s.jpg
I wish to catch the exception and stop the process of that news, by I can't caught the exception... someone can help me?
http://www.accountancyage.com/aa/analysis/2111729/institutes-ifrs-bang
DefaultDocumentCleaner.cleanBadTags detected "content_print" as naughtyID and removing. The actual news is under this div tag.
[id~=(" + regExRemoveNodes + ")] matches print in id="content_print" and removing.
Is it better to go for exact word boundary match instead of fuzzy match??
Re : https://github.com/jiminoc/goose/blob/master/src/main/java/com/jimplush/goose/network/HtmlFetcher.java#L135
Can you tell me the significance of the number 15728640 bytes or 15 MB ?
Is there any reason that the header Accept-Encoding (gzip or deflate) is not used ?
Are there any other considerations I should keep in mind while I go ahead with the upgrade ?
Thanks.
currently Goose defaults to a mozilla user agent, need ability to provide alternative user agents
*note working on this today.
Would help a lot.
Several problems with this class:
Line 192 catches an exception from httpget.abort() and swallows it
Several exceptions after line 152 log at trace lever, not even info or warn
These issues are making it difficult to diagnose problems with my library.
I tested this jar for goose:
https://www.dropbox.com/s/h0tu7bhl834ylnz/goose.jar from this tread:
https://github.com/jiminoc/goose/issues/59 by qnex.
It works in a normal java project but in Android I get this error:
10-05 12:45:05.858: E/AndroidRuntime(1825): FATAL EXCEPTION: main
10-05 12:45:05.858: E/AndroidRuntime(1825): java.lang.RuntimeException: java.lang.reflect.InvocationTargetException
10-05 12:45:05.858: E/AndroidRuntime(1825): at com.android.internal.os.ZygoteInit.main(ZygoteInit.java:638)
10-05 12:45:05.858: E/AndroidRuntime(1825): at dalvik.system.NativeStart.main(Native Method)
10-05 12:45:05.858: E/AndroidRuntime(1825): Caused by: java.lang.reflect.InvocationTargetException
10-05 12:45:05.858: E/AndroidRuntime(1825): at java.lang.reflect.Method.invokeNative(Native Method)
10-05 12:45:05.858: E/AndroidRuntime(1825): at java.lang.reflect.Method.invoke(Method.java:507)
10-05 12:45:05.858: E/AndroidRuntime(1825): at com.android.internal.os.ZygoteInit$MethodAndArgsCaller.run(ZygoteInit.java:880)
10-05 12:45:05.858: E/AndroidRuntime(1825): ... 2 more
10-05 12:45:05.858: E/AndroidRuntime(1825): Caused by: java.lang.Exception: /tmp/goose directory does not seem to exist, you need to set this for image processing downloads
10-05 12:45:05.858: E/AndroidRuntime(1825): at com.gravity.goose.Goose.initializeEnvironment(Goose.scala:68)
10-05 12:45:05.858: E/AndroidRuntime(1825): at com.gravity.goose.Goose.<init>(Goose.scala:31)
10-05 12:45:05.858: E/AndroidRuntime(1825): at com.example.goose.MainActivity$1.onClick(MainActivity.java:27)
10-05 12:45:05.858: E/AndroidRuntime(1825): at android.view.View.performClick(View.java:2485)
10-05 12:45:05.858: E/AndroidRuntime(1825): at android.view.View$PerformClick.run(View.java:9081)
10-05 12:45:05.858: E/AndroidRuntime(1825): at android.os.Handler.handleCallback(Handler.java:587)
10-05 12:45:05.858: E/AndroidRuntime(1825): at android.os.Handler.dispatchMessage(Handler.java:92)
10-05 12:45:05.858: E/AndroidRuntime(1825): at android.os.Looper.loop(Looper.java:130)
10-05 12:45:05.858: E/AndroidRuntime(1825): at android.app.ActivityThread.main(ActivityThread.java:3770)
10-05 12:45:05.858: E/AndroidRuntime(1825): ... 5 more
Can someone help me with that? :)
Value exists in Configuration, but also exists in UpgradedImageIExtractor, which references its local version
It appears that goose.extractContent(url) is returning a valid com.gravity.goose.Article object, however it appears that it is empty (at least article.cleanedArticleText returns nothing). The assert that fails expects that extractContent(url) will return null rather than an object.
Fails to get the list items near the start of:
Specifying Goose as a Maven dependency in a project fails to fetch the artifacts. Please upload the artifacts to Maven Central, or publish the Maven repository URL in the README if it is something other than Maven Central.
I tried this page:
And got this result (the numbers between square brackets show the original order of the phrases) :
[5] Participants were required to sign a legal disclaimer prior to taking part in the competition, and two members of the British Red Cross were on hand, but they could not cope with the nature of the injuries sustained.
[4] Paramedics attended the event on Saturday - the busiest day of the week for the ambulance service - costing the service several hundred pounds.
[3] Today, the Scottish Ambulance Service said it wanted the restaurant to review the way the event was managed.
[2] One participant, Curie Kim was so ill after sampling the "Kismot Killer" that she had to be taken by ambulance to the Edinburgh Royal Infirmary twice in a matter of hours.
[1] Emergency services were called to Kismot Restaurant's curry-eating challenge, on St Leonards Place, Edinburgh, after competitors started writhing on the floor in agony, vomiting and fainting during the contest.
[6] Curry house owner Abdul Ali admitted that he would have to "tone down" the contest, but said the challenge had raised hundreds of pounds for charity CHAS.
[7] He added that half of the 20 people who took part in the challenge had dropped out after witnessing the first 10 diners vomiting, collapsing, sweating and panting.
[8] Previously the restaurant's Kismot Killer dish has caused diners to suffer nose bleeds and one elderly man had to go to hospital.
...
Love this as a java option to readability and am currently using it in my android app. One thing I have noticed though is for every article on http://host.ap.org it chooses this image (http://hosted.ap.org/specials/images/ap_photo_promo.jpg) as the main image.
At a glance it seems that Goose is ignoring the network proxy configuration via System properties (http.proxyHost and "http.proxyPost).
Is there a way to configure that? If so, can someone please point me out the documentation?
Thanks
Well not sure whether it is an issue or not.
I am using goose to extract the text but I also needed it to return the redirected final url. So I changed the htmlFetcher static method to do it. My question would be, does it make goose slow? I have not done any performance testing with it. Wanted to ask you guys before using the changes I have made.
I would be happy to give a patch if you guys think it would be a value add.
Another change which I made was on image fetching. You guys had the code to download the images and then search for image tags for back-up. It was slow and I needed the image more for aesthetic purpose and changed the code and added extra boolean parameter to decide whether to use the downloading code or just use image tag parsing.
Thanks,
Sharath
Hi,
I modified TalkToMeGoose.scala to output more fields and plan to use it from the command line. However, when I do:
println(article.topImage)
I get
com.gravity.goose.images.Image@36db492
Which is not a file path, url or binary data. I looked in /tmp but couldn't find the image file in there.
Can you please show me how to get the image file?
Thanks
Why is akka listed in the maven pom dependencies? i can run mvn test successfully without it in the pom.
Hi,
I just wondered if you have seen Dispatch library 1 to consider for your HTTP access instead of Apache HTTPClient
Best,
Amir
I've tried to use goose API to parse Chinese, Japanese, Korean webpages,and none of them work. Any plan to support this ?
would be nice to have a command line runnable version of goose to be able to do extractions on the fly that can be piped to other unix commands such as
goose http://site.com/article1 > article1.txt
suggested by John (https://github.com/b79)
http://allthingsd.com/20111123/nokia-siemens-to-cut-17000-jobs/
http://allthingsd.com/20111123/tripadvisor-makes-itself-available-offline-to-help-travelers-avoid-roaming-charges/
Actual Content tag score less than irrelevant tag and also link density is higher in original content.
I noticed the code snippets are ommitted from the following articles:
http://hacks.mozilla.org/2010/01/offline-web-applications/
http://www.nczonline.net/blog/2009/01/13/speed-up-your-javascript-part-1/
Hi,
I'm trying to test Goose with 11M urls and I always have an out of memory due to jsoup after few pages. In the stack I can always see that jsoup accumulate lot of data. Is there any way to clean the extractor after a page has been processed?
Hello,
What is the best way to put the goose in a webservice running over php?
Thanks,
Rui Gaspar
Calling:
Goose = new GooseContentParser( new Configuration());
contentParser.parseContentUsingGoose('http://www.tulsaworld.com/site/articlepath.aspx?articleid=20111118_61_A16_Opposi344152&rss_lnk=7');
Causes a NullPointerException. I reported the link on the demo page as it fails to load this article as well.
i've noticed on recipe sites, like foodtv.com, the ingredients section gets cut off
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.