guorouda / crawler4j Goto Github PK

View Code? Open in Web Editor NEW

0.0 0.0 0.0 544 KB

Automatically exported from code.google.com/p/crawler4j

Java 100.00%

crawler4j's People

Contributors

crawler4j's Issues

Exception in thread "main" : com.sleepycat.je.EnvironmentLockedException

I was just trying to setup the crawler4j with the code given in google page 
(simple). But when i run it ( using Eclipse IDE )it throws the exception given 
below. Can you help me pointing out whats going wrong.


Deleting content of: E:\crawldata\frontier
Exception in thread "main" com.sleepycat.je.EnvironmentLockedException: (JE 
4.0.71) E:\crawldata\frontier The environment cannot be locked for single 
writer access. ENV_LOCKED: The je.lck file could not be locked. Environment is 
invalid and must be closed.
    at com.sleepycat.je.log.FileManager.<init>(FileManager.java:346)
    at com.sleepycat.je.dbi.EnvironmentImpl.<init>(EnvironmentImpl.java:435)
    at com.sleepycat.je.dbi.EnvironmentImpl.<init>(EnvironmentImpl.java:337)
    at com.sleepycat.je.dbi.DbEnvPool.getEnvironment(DbEnvPool.java:182)
    at com.sleepycat.je.Environment.makeEnvironmentImpl(Environment.java:230)
    at com.sleepycat.je.Environment.<init>(Environment.java:212)
    at com.sleepycat.je.Environment.<init>(Environment.java:166)
    at edu.uci.ics.crawler4j.crawler.CrawlController.<init>(CrawlController.java:56)
    at project.sentiment.Controller.main(Controller.java:22)


****************************
What steps will reproduce the problem?
1.
2.
3.

What is the expected output? What do you see instead?


What version of the product are you using? On what operating system?
- crawler4j-2.2 on Win XP

Please provide any additional information below.

Original issue reported on code.google.com by [email protected] on 5 Oct 2010 at 8:43

Exception While Crawling !!

What steps will reproduce the problem?
1. I am using a list of urls to start many concurrent web crawlers using your 
libraries, I was very keen on synchronization issues and so on.
2. Run the crawler on some machines get me a strange exception as follows:
****************************************
Exception in thread "Crawler 8" java.lang.IllegalStateException: Can't open a 
cursor Database was closed.
    at com.sleepycat.je.Database.checkOpen(Database.java:1745)
    at com.sleepycat.je.Database.openCursor(Database.java:619)
    at edu.uci.ics.crawler4j.frontier.WorkQueues.get(WorkQueues.java:50)
    at edu.uci.ics.crawler4j.frontier.Frontier.getNextURLs(Frontier.java:74)
    at edu.uci.ics.crawler4j.crawler.WebCrawler.run(WebCrawler.java:72)
    at java.lang.Thread.run(Thread.java:636)
java.lang.IllegalStateException: Can't call Database.put Database was closed.
    at com.sleepycat.je.Database.checkOpen(Database.java:1745)
    at com.sleepycat.je.Database.put(Database.java:1046)
    at edu.uci.ics.crawler4j.frontier.WorkQueues.put(WorkQueues.java:100)
    at edu.uci.ics.crawler4j.frontier.Frontier.scheduleAll(Frontier.java:48)
    at edu.uci.ics.crawler4j.crawler.WebCrawler.preProcessPage(WebCrawler.java:142)
    at edu.uci.ics.crawler4j.crawler.WebCrawler.run(WebCrawler.java:85)
    at java.lang.Thread.run(Thread.java:636)
java.lang.IllegalStateException: Can't call Database.put Database was closed.
    at com.sleepycat.je.Database.checkOpen(Database.java:1745)
    at com.sleepycat.je.Database.put(Database.java:1046)
    at edu.uci.ics.crawler4j.frontier.WorkQueues.put(WorkQueues.java:100)
    at edu.uci.ics.crawler4j.frontier.Frontier.scheduleAll(Frontier.java:48)
    at edu.uci.ics.crawler4j.crawler.WebCrawler.preProcessPage(WebCrawler.java:142)
    at edu.uci.ics.crawler4j.crawler.WebCrawler.run(WebCrawler.java:85)
    at java.lang.Thread.run(Thread.java:636)
java.lang.IllegalStateException: Can't call Database.put Database was closed.
    at com.sleepycat.je.Database.checkOpen(Database.java:1745)
    at com.sleepycat.je.Database.put(Database.java:1046)
    at edu.uci.ics.crawler4j.frontier.WorkQueues.put(WorkQueues.java:100)
    at edu.uci.ics.crawler4j.frontier.Frontier.scheduleAll(Frontier.java:48)
    at edu.uci.ics.crawler4j.crawler.WebCrawler.preProcessPage(WebCrawler.java:142)
    at edu.uci.ics.crawler4j.crawler.WebCrawler.run(WebCrawler.java:85)
    at java.lang.Thread.run(Thread.java:636)
java.lang.IllegalStateException: Can't call Database.put Database was closed.
    at com.sleepycat.je.Database.checkOpen(Database.java:1745)
    at com.sleepycat.je.Database.put(Database.java:1046)
    at edu.uci.ics.crawler4j.frontier.WorkQueues.put(WorkQueues.java:100)
    at edu.uci.ics.crawler4j.frontier.Frontier.scheduleAll(Frontier.java:48)
    at edu.uci.ics.crawler4j.crawler.WebCrawler.preProcessPage(WebCrawler.java:142)
    at edu.uci.ics.crawler4j.crawler.WebCrawler.run(WebCrawler.java:85)
    at java.lang.Thread.run(Thread.java:636)
java.lang.IllegalStateException: Can't call Database.put Database was closed.
    at com.sleepycat.je.Database.checkOpen(Database.java:1745)
    at com.sleepycat.je.Database.put(Database.java:1046)
    at edu.uci.ics.crawler4j.frontier.WorkQueues.put(WorkQueues.java:100)
    at edu.uci.ics.crawler4j.frontier.Frontier.scheduleAll(Frontier.java:48)
    at edu.uci.ics.crawler4j.crawler.WebCrawler.preProcessPage(WebCrawler.java:142)
    at edu.uci.ics.crawler4j.crawler.WebCrawler.run(WebCrawler.java:85)
    at java.lang.Thread.run(Thread.java:636)
****************************************

What is the expected output? What do you see instead?
- It's expected to do crawling silently without any exceptions.

What version of the product are you using? On what operating system?
- I tried the latest version "crawler4j-2.2". i tried it on both Linux/Ubuntu 
10.04.1 and on Windows XP SP3.

Please provide any additional information below.
- I am using wget to download the pages completely in a website/like directory 
structure.
- Sometimes when i am using windows xp sp3 that's on the same machine that i 
have ubuntu on it, it works very well without exceptions.
- I am packing the whole code in executable jar and i am running it on command 
line.
- When running the executable jar i expand the minimum and maximum heap spaces 
to 512m and 1024m respectively.

Thanks

Original issue reported on code.google.com by hafez.khaled on 4 Oct 2010 at 10:12

Failure on Non-UTF-8 pages

What steps will reproduce the problem?
1. Add a seed with non-utf-8 content
2. Start the crawl


What is the expected output? What do you see instead?
Expected: Correct output / load of iso-8859-1 pages
Instead: Wrong Characters.

What version of the product are you using? On what operating system?
1.8.1, Windows XP

Please provide any additional information below.
The crawler seems to work only with utf-8 pages

Original issue reported on code.google.com by [email protected] on 17 Apr 2010 at 4:30

How to run the crawl process

Hi,

I am new to this crawling .I want to know how to start the crawling and how we 
can see the crawled result in a layman perspective..
When I simply run the file controller.java, we are supposed to enter the root 
and the number of times we need to crawl. But i dont know to where the control 
is going after that.
It would to be great if anyone can help me to get the actual flow of the  
crawling in detail.

Original issue reported on code.google.com by [email protected] on 22 Jun 2010 at 11:21

Resume Crawl - Enhancement

What steps will reproduce the problem?
1. Start a crawl that's going to take 8 weeks
2. terminate a crawl (unexpectedly) after 6 weeks 
3. start a new crawl - 8 more weeks (maybe)! 

What is the expected output? What do you see instead?
Would expect to open up the databases, continue through the work queues, not 
schedule any visits if the docID already exists. Currently must start over and 
revisit all pages.

What version of the product are you using? On what operating system?
c4j 2.2 / Mac OS X Server 10.6.4

Please provide any additional information below.
I'm trying to implement this but don't have any experience with BerkeleyDB. 
I've eliminated the home folder deletion and local data purge from the 
constructor and start methods of the controller. I've also turned off 
deferredWrite for the DocIDsDB but it's reporting a 0 count after terminating 
and running the program.

Is this a trivial change to implement? If so can you point me in the right 
direction? If not, can you help me understand what are the challenges? I'm 
going to keep trying but if you have any pointers that'd be appreciated.

Thanks!

Original issue reported on code.google.com by [email protected] on 7 Oct 2010 at 5:49

All the seeds get crawled or visited before any further depth is crawled

What steps will reproduce the problem?

Add a good number of seeds, say 100,000.

What is the expected output? What do you see instead?

Crawler4j is waiting to finish up all the seeds before it even start going down 
through any single seed. I was expecting it to start the deeper depths also in 
parallel

What version of the product are you using? On what operating system?
2.6
Windows 7

Original issue reported on code.google.com by [email protected] on 2 May 2011 at 11:19

All the seeds get crawled or visited before any further depth is crawled

What steps will reproduce the problem?

Add a good number of seeds, say 100,000.

What is the expected output? What do you see instead?

Crawler4j is waiting to finish up all the seeds before it even start going down 
through any single seed. I was expecting it to start the deeper depths also in 
parallel

What version of the product are you using? On what operating system?
2.6
Windows 7

Original issue reported on code.google.com by [email protected] on 2 May 2011 at 11:20

How do we dynamically set the websites that the crawler should visit

We start with a seed site and then we want to restrict the crawler to only 
follow links that are specified dynamically. We'd like to be able to pass 
that list of links to WebCrawler via a method such as setAllowedHostnames() 
since we don't want to hard code the list of sites in shouldVisit(). What's 
the best way to accomplish this?

Original issue reported on code.google.com by [email protected] on 6 May 2010 at 1:37

Exception after replacing v1.7

What steps will reproduce the problem?
1. Create a Singlton Class with private Constructor (ConsoleApp)
2. Create class extending WebCrawler with methods: main, shouldVisit,
visit, getMyLocalData, onBeforeExit, dumpMyData (all without code)
3. Call this in Singleton:
CrawlController controller = new CrawlController("myfolder");
controller.addSeed("http://www.test.de");
controller.start(MyWebCrawler.class, 1);

What is the expected output? What do you see instead?
Expected: Normal logging entries
Instead:
java.lang.NullPointerException
    at java.util.Properties$LineReader.readLine(Unknown Source)
    at java.util.Properties.load0(Unknown Source)
    at java.util.Properties.load(Unknown Source)
    at
edu.uci.ics.crawler4j.crawler.Configurations.<clinit>(Configurations.java:33)
    at edu.uci.ics.crawler4j.crawler.HTMLParser.<clinit>(HTMLParser.java:31)
    at edu.uci.ics.crawler4j.crawler.WebCrawler.<init>(WebCrawler.java:41)
    at main.java.MyCrawler.<init>(MyCrawler.java:15)
    at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
    at sun.reflect.NativeConstructorAccessorImpl.newInstance(Unknown Source)
    at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(Unknown Source)
    at java.lang.reflect.Constructor.newInstance(Unknown Source)
    at java.lang.Class.newInstance0(Unknown Source)
    at java.lang.Class.newInstance(Unknown Source)
    at
edu.uci.ics.crawler4j.crawler.CrawlController.start(CrawlController.java:68)
    at main.java.ConsoleApp.start(ConsoleApp.java:226)
    at main.java.Crawler.main(Crawler.java:10)
Exception in thread "main" java.lang.ExceptionInInitializerError
    at edu.uci.ics.crawler4j.crawler.WebCrawler.<init>(WebCrawler.java:41)
    at main.java.MyCrawler.<init>(MyCrawler.java:15)
    at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
    at sun.reflect.NativeConstructorAccessorImpl.newInstance(Unknown Source)
    at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(Unknown Source)
    at java.lang.reflect.Constructor.newInstance(Unknown Source)
    at java.lang.Class.newInstance0(Unknown Source)
    at java.lang.Class.newInstance(Unknown Source)
    at
edu.uci.ics.crawler4j.crawler.CrawlController.start(CrawlController.java:68)
    at main.java.ConsoleApp.start(ConsoleApp.java:226)
    at main.java.Crawler.main(Crawler.java:10)
Caused by: java.lang.NumberFormatException: null
    at java.lang.Integer.parseInt(Unknown Source)
    at java.lang.Integer.parseInt(Unknown Source)
    at
edu.uci.ics.crawler4j.crawler.Configurations.getIntProperty(Configurations.java:
20)
    at edu.uci.ics.crawler4j.crawler.HTMLParser.<clinit>(HTMLParser.java:31)
    ... 11 more

What version of the product are you using? On what operating system?
Windows XP, 1.8 / 1.8.1

Please provide any additional information below.
I just replaced Version 1.7 with 1.8 / 1.8.1 and the error comes up. Before
1.8 everything worked fine.

Original issue reported on code.google.com by [email protected] on 14 Apr 2010 at 7:30

'Depth' parameter doesn't work. Also can't "resume" crawling. Seems that sources don't match binary classes in version 2.6.1

What steps will reproduce the problem?
1. Try to check 'depth' property of WebURL inside WebCrawler.shouldVisit() 
method, it's always equal to 0.
2. I tried to use constructor CrawlController(String, boolean), but it's 
undefined. According to source code from 
http://crawler4j.googlecode.com/svn/trunk/crawler4j/src/edu/uci/ics/crawler4j/cr
awler/CrawlController.java it should be present.

What is the expected output? What do you see instead?
I exprected that crawler.max_depth parameter works. But it doesn't.
Also every time I restart crawler storage folder is cleared. I tried to use 
CrawlController(String, boolean) explicitly , but in my jar it's absent.

What version of the product are you using? On what operating system?
version - 2.6.1.
OS - windows 7

Please provide any additional information below.
Seems that jar crawler4j-2.6.1 was built using non up-to-date sources.

Original issue reported on code.google.com by [email protected] on 13 Apr 2011 at 10:44

All the seeds get crawled or visited before any further depth is crawled

What steps will reproduce the problem?

Add a good number of seeds, say 100,000.

What is the expected output? What do you see instead?

Crawler4j is waiting to finish up all the seeds before it even start going down 
through any single seed. I was expecting it to start the deeper depths also in 
parallel

What version of the product are you using? On what operating system?
2.6
Windows 7

Original issue reported on code.google.com by [email protected] on 2 May 2011 at 11:20

All the seeds are crawled before the in depth crawl starts

What steps will reproduce the problem?

Add a good number of seeds, say 100,000.

What is the expected output? What do you see instead?

Crawler4j is waiting to finish up all the seeds before it even start going down 
through any single seed. I was expecting it to start the deeper depths also in 
parallel

What version of the product are you using? On what operating system?
2.6
Windows 7

Original issue reported on code.google.com by [email protected] on 2 May 2011 at 11:30

Add JavaDocs

Add JavaDocs to source

Original issue reported on code.google.com by [email protected] on 14 Dec 2010 at 9:02

Notification of crawl finish

It would be nice if our java program can be notified that the crawl has
completed.  
I am thinking that we can implement this method "onCrawlFinish" in the
controller, and folks can subclass the Controller and provide this method,
or we can create a "CrawlEventListener" that can be passed into the
controller and call this object when the crawl has completed.

Original issue reported on code.google.com by [email protected] on 29 Mar 2010 at 4:44

download mp3 file as NON binary file

mp3 file downloaded as a text file, please help to fix it.

i think it because of the code in java file 
edu.uci.ics.crawler4j.crawler.PageFetcher

method public static int fetch(Page page, boolean ignoreIfBinary) {:

Header type = entity.getContentType();
                if (type != null && type.getValue().toLowerCase().contains("image")) {
                    isBinary = true;
                    if (ignoreIfBinary) {
                        return PageFetchStatus.PageIsBinary;
                    }
                }

Original issue reported on code.google.com by [email protected] on 18 Mar 2011 at 10:12

Graceful stop/abort - good to have

Hi,
Since the tomcat is shared, it can't be stopped to stop a running crawler. 
It'll be good to have the ability to 'trigger' a stop/abort the running 
crawler, which will stop once all the running threads are done.

What steps will reproduce the problem?
-NA

What is the expected output? What do you see instead?
-NA

What version of the product are you using? On what operating system?
-1.8

Please provide any additional information below.

Thanks.

Original issue reported on code.google.com by [email protected] on 6 Aug 2010 at 7:15

All the seeds get crawled or visited before any further depth is crawled

What steps will reproduce the problem?

Add a good number of seeds, say 100,000.

What is the expected output? What do you see instead?

Crawler4j is waiting to finish up all the seeds before it even start going down 
through any single seed. I was expecting it to start the deeper depths also in 
parallel

What version of the product are you using? On what operating system?
2.6
Windows 7

Original issue reported on code.google.com by [email protected] on 2 May 2011 at 11:20

All the seeds are crawled before the in depth crawl starts

What steps will reproduce the problem?

Add a good number of seeds, say 100,000.

What is the expected output? What do you see instead?

Crawler4j is waiting to finish up all the seeds before it even start going down 
through any single seed. I was expecting it to start the deeper depths also in 
parallel

What version of the product are you using? On what operating system?
2.6
Windows 7

Original issue reported on code.google.com by [email protected] on 2 May 2011 at 11:27

Crawler doesn't follow relative links correctly

What steps will reproduce the problem?
1. Use the sample from online
2. Create a page http://domain.com/foo/bar/
3. Have a <a href="test.php">test</a> on /foo/bar
4. Crawl http://domain.com/foo/

Expecting for the crawler to crawl http://domain.com/foo/bar/test.php

But the crawler follows http://domain.com/foo/test.php which is a 404

Original issue reported on code.google.com by [email protected] on 17 Oct 2010 at 11:40

All the seeds get crawled or visited before any further depth is crawled

What steps will reproduce the problem?

Add a good number of seeds, say 100,000.

What is the expected output? What do you see instead?

Crawler4j is waiting to finish up all the seeds before it even start going down 
through any single seed. I was expecting it to start the deeper depths also in 
parallel

What version of the product are you using? On what operating system?
2.6
Windows 7

Original issue reported on code.google.com by [email protected] on 2 May 2011 at 11:20

All the seeds get crawled or visited before any further depth is crawled

What steps will reproduce the problem?

Add a good number of seeds, say 100,000.

What is the expected output? What do you see instead?

Crawler4j is waiting to finish up all the seeds before it even start going down 
through any single seed. I was expecting it to start the deeper depths also in 
parallel

What version of the product are you using? On what operating system?
2.6
Windows 7

Original issue reported on code.google.com by [email protected] on 2 May 2011 at 11:19

Add build.xml

Hi, 
I found your crawler very neat, simple and easy to use. Just what I needed. 
However, I noticed that it wasn't handling encoding correctly. Then I saw that 
in SVN code you have corrected that issue already. I downloaded the code and 
built a new JAR myself. Maybe you can add build.xml to the project so anyone 
can download the project and create a new build easily - you can also add jar 
tasks and so on. Here is what I have created for building the project 
(attached).

Thanks and good work.

Original issue reported on code.google.com by [email protected] on 22 Oct 2010 at 10:14

Attachments:

build.xml

What is the maximum number of seeds that can be given?

What steps will reproduce the problem?
I have a 50 million seeds, I am trying to add all of them to the Controller, 
and want to run multiple threads. Will this work? Assume that I have enough 
memory and cpu power.

What is the expected output? What do you see instead?


What version of the product are you using? On what operating system?
2.6


Please provide any additional information below.

Original issue reported on code.google.com by [email protected] on 29 Apr 2011 at 7:16

Image func. broken in 2.2 - what was 2.1's bug?

What steps will reproduce the problem?
1. Try to use the Image crawling example (which is in the trunk...)
2. isBinary and the binary data download both fail

What is the expected output? What do you see instead?
The same output as in 2.1

What version of the product are you using? On what operating system?
2.2 on MacOSx, downloaded as the 2.2 zipped file from the google project page

Please provide any additional information below.
You mentioned that 2.2 fixes a minor bug. Would you let us know which this bug 
is? I can't find source code from 2.1 to compare, but Have to use the 2.1 
library, personally, since Image Crawling does not work (the respective 
functions have been Removed from 2.2.....) is it important that I use 2.2?

Also, do you have plans on integrating more functionality, e.g. the charset 
detection, robots etc, or the bugs that have been submitted?

Of course, congratulations for your work and time spent!

Original issue reported on code.google.com by [email protected] on 7 Jan 2011 at 4:16

support for robots.txt

Does crawler4j support for robots.txt?

Original issue reported on code.google.com by [email protected] on 15 Mar 2010 at 6:48

All the seeds get crawled or visited before any further depth is crawled

What steps will reproduce the problem?

Add a good number of seeds, say 100,000.

What is the expected output? What do you see instead?

Crawler4j is waiting to finish up all the seeds before it even start going down 
through any single seed. I was expecting it to start the deeper depths also in 
parallel

What version of the product are you using? On what operating system?
2.6
Windows 7

Original issue reported on code.google.com by [email protected] on 2 May 2011 at 11:20

How to force crawler4j to stay within initial domain

I'm hoping that someone can help me with this:

I am trying to force the crawler to only crawl data within a domain.

e.g. I will be passing several domains on to the crawlers (as individual 
executions) and would like them to only start and stop when having finished 
with any links in a domain's pages that point to other pages in the same domain.

I haven't managed to "pass" the Domain Name of the first DocId (first page that 
is being crawled, i.e. the starting Domain) on to the ShouldVisit function so 
that any other links are discarded...

Can anyone help with this?

I also tried adding

                 //               if (shouldVisit(cur) && canonicalUrl.startsWith(domainName)) {
                 //                   cur.setParentDocid(docid);
                 //                   toSchedule.add(cur);
                 //               }

within the preProcessPage (inside WebCrawler.java) but I can't pass the 
"domainName" to this function, as it's executed by "run", and the thread won't 
take arguments...

Original issue reported on code.google.com by [email protected] on 10 Jan 2011 at 1:27

Multithreading and protection against duplicates

What steps will reproduce the problem?

1. I expanded the program with new functionality: search for certain domains. I 
secured it before looking for duplicates.

2. A PROBLEM: protection against duplicates only works when I run one thread. 
When I run a few threads, the program replicates an array of variables, with 
which the domain is compared. And as a result of the program continues to 
create duplicates.

3. QUESTION: How to use thread synchronization in this case and whether or not 
it applied? Maybe there is another way?

What is the expected output? What do you see instead? -> The program repeats 
the same domain as the output results (creates duplicates).


What version of the product are you using? On what operating system? -> The 
latest version crawler4j. Windows XP.


Please provide any additional information below.

The entire content of the problem:
I expanded the program with new functionality: search for certain domains. I 
secured it before looking for duplicates.
But I have a problem, protection against duplicates only works when I run one 
thread.
When I run a few threads, the program replicates an array of variables, with 
which the domain is compared. And as a result of the program continues to 
create duplicates.
I know that in that case it would be necessary to use a thread synchronization.
Functions of domain searching using regular expressions, and to prevent 
duplicates I have in the class MyCrawler.
But the thread synchronization feature does not work in this place the code.
How to use thread synchronization in this case and whether or not it applied? 
Maybe there is another way?

Sorry for my English.

Original issue reported on code.google.com by [email protected] on 19 May 2011 at 4:02

All the seeds get crawled or visited before any further depth is crawled

What steps will reproduce the problem?

Add a good number of seeds, say 100,000.

What is the expected output? What do you see instead?

Crawler4j is waiting to finish up all the seeds before it even start going down 
through any single seed. I was expecting it to start the deeper depths also in 
parallel

What version of the product are you using? On what operating system?
2.6
Windows 7

Original issue reported on code.google.com by [email protected] on 2 May 2011 at 11:20

Errornous link url extraction from a html

Hi,
I found that the parser extracts links from a html page which are not correct - 
it omits some portion of the path, which results in 404. It specially happen 
sometime when no file is given in the url (default file is being served) or 
when crawling nested folder structure (served by a web server) which ultimately 
leads to files.

What steps will reproduce the problem?
1. try to crawl urls without indicating a file
2. try to crawl a web-server served nested (upto 3/4 levels) folder structure
3.

What is the expected output? What do you see instead?
- Urls misses last part of the url path, and results in 404.

What version of the product are you using? On what operating system?
- v1.8 / Linux 2.6.9-11.ELsmp / jre-1.5.0

Please provide any additional information below.

Thanks.
praveen ([email protected])

Original issue reported on code.google.com by [email protected] on 6 Aug 2010 at 7:24

Not an issue but something that would be nice to add

Hello,

Nice job on the crawler! I think it would be awesome to add some
documentation on how to implement that crawler on a  linux server, and how
it can interact with a MySQL database.

Let's say I have a list of URL to crawl in my MySQL database. I would like
the crawler to be able to get them from there and fill up the title and
description meta data in the database.

At this point, I'm not even sure on how to run that on my linux server!

Thank you! and keep up the good work :)

Original issue reported on code.google.com by [email protected] on 10 Apr 2010 at 6:50

Errornous link URL extraction if the HTML contains <base href="...">

What steps will reproduce the problem?
1. Crawl a site http://a.b/c/d/e.html where the HTML contains <base 
href="http://a.b/c/">
2. Any relative links in the page will be wrongly extracted, e.g. "../x.html" 
will be extracted as "http://a.b/c/x.html" instead of "http://a.b/x.html"

What is the expected output? What do you see instead?
Any relative links in the page will be wrongly extracted, e.g. "../x.html" will 
be extracted as "http://a.b/c/x.html" instead of "http://a.b/x.html"

What version of the product are you using? On what operating system?
version 2.2 and latest build from SVN. Windows 7.


Please provide any additional information below.
The attached patch on /src/edu/uci/ics/crawler4j/crawler/HTMLParser.java may 
help.

Original issue reported on code.google.com by [email protected] on 31 Dec 2010 at 2:42

Attachments:

HTMLParser.diff

How to Stop crawler and then restarting web crawler with different seeds


Hi,
Is there any method that can be used to stop  crawler and again restart crawler.
Can provide new seeds to crawler at run time?

Original issue reported on code.google.com by [email protected] on 16 Dec 2010 at 8:38

All the seeds get crawled or visited before any further depth is crawled

What steps will reproduce the problem?

Add a good number of seeds, say 100,000.

What is the expected output? What do you see instead?

Crawler4j is waiting to finish up all the seeds before it even start going down 
through any single seed. I was expecting it to start the deeper depths also in 
parallel

What version of the product are you using? On what operating system?
2.6
Windows 7

Original issue reported on code.google.com by [email protected] on 2 May 2011 at 11:20

Processing of robots.txt causes java.lang.StringIndexOutOfBoundsException: String index out of range: -3

What steps will reproduce the problem?
1.Run crawler for domain with has robots.txt file with 'allow:' instruction 
(for example http://www.explido-webmarketing.de/)


What is the expected output? What do you see instead?
Exception appears:

java.lang.StringIndexOutOfBoundsException: String index out of range: -3
    at java.lang.String.substring(String.java:1937)
    at java.lang.String.substring(String.java:1904)
    at edu.uci.ics.crawler4j.robotstxt.RobotstxtParser.parse(RobotstxtParser.java:86)
    at edu.uci.ics.crawler4j.robotstxt.RobotstxtServer.fetchDirectives(RobotstxtServer.java:77)
    at edu.uci.ics.crawler4j.robotstxt.RobotstxtServer.allows(RobotstxtServer.java:57)
    at edu.uci.ics.crawler4j.crawler.WebCrawler.preProcessPage(WebCrawler.java:187)
    at edu.uci.ics.crawler4j.crawler.WebCrawler.run(WebCrawler.java:105)
...


What version of the product are you using? On what operating system?
version - 2.6
operation system - windows 7

Please provide any additional information below.
Seems that value of constant 
edu.uci.ics.crawler4j.robotstxt.RobotstxtServer.PATTERNS_ALLOW_LENGTH is 
incorrect.

Original issue reported on code.google.com by [email protected] on 16 Mar 2011 at 5:03

crawl to infinity

What steps will reproduce the problem?
1. Create a PHP-Script with the following content:

<?php
if(!isset($_SERVER["QUERY_STRING"]))
    $_SERVER["QUERY_STRING"] = "";
$link = $_SERVER["PHP_SELF"]."?".$_SERVER["QUERY_STRING"]."&test=test";
$link = str_replace("?&", "?", $link);
?>
<a href="<?php echo $link;?>">test</a>

2. Run your Crawler against this page

What is the expected output? What do you see instead?
Expected: Elegant way to prevent this behaviour
Instead: Only an URL-Object in shouldVisit to perform Checks

What version of the product are you using? On what operating system?
1.8.1, Windows XP

Please provide any additional information below.
Perhaps there is an elegant way, i am not aware of...
AND: I do not find a way to change this issue to an enhancement.

Original issue reported on code.google.com by [email protected] on 15 Apr 2010 at 12:36

NoHttpResponseException

What steps will reproduce the problem?
1.why this happen 
2.
3.

INFO [Crawler 1] 
(DefaultRequestDirector.java:491) - I/O exception 
(org.apache.http.NoHttpResponseException) caught when processing request: The 
target server failed to respond
 INFO [Crawler 1] (DefaultRequestDirector.java:498) - Retrying request

Original issue reported on code.google.com by [email protected] on 21 Jun 2010 at 10:47

All the seeds get crawled or visited before any further depth is crawled

What steps will reproduce the problem?

Add a good number of seeds, say 100,000.

What is the expected output? What do you see instead?

Crawler4j is waiting to finish up all the seeds before it even start going down 
through any single seed. I was expecting it to start the deeper depths also in 
parallel

What version of the product are you using? On what operating system?
2.6
Windows 7

Original issue reported on code.google.com by [email protected] on 2 May 2011 at 11:20

Make cookie policy configurable

Some servers screw up cookies.  It would be great to be able to set cookie 
policy, for example to IGNORE_COOKIES: 

HttpGet httpget = new HttpGet("http://www.broken-server.com/");
// Override the default policy for this request
httpget.getParams().setParameter(
        ClientPNames.COOKIE_POLICY, CookiePolicy.BROWSER_COMPATIBILITY);


See: 
http://hc.apache.org/httpcomponents-client-ga/tutorial/html/statemgmt.html#d4e80
8
and 
http://hc.apache.org/httpcomponents-client-ga/httpclient/apidocs/constant-values
.html#org.apache.http.client.params.ClientPNames.COOKIE_POLICY

Original issue reported on code.google.com by [email protected] on 13 May 2011 at 6:26

how to incremental crawler a site?


every time when i start a crawler, it delete then "frontier" directory, and 
begin from the first page,
how to download a site from last stopped place? please help.

Original issue reported on code.google.com by [email protected] on 24 Mar 2011 at 6:43

Silent stop of the crawler

Hi,
I found that the crawler doesn't crawl any page, and goes down without saying 
anything in its logs or tomcat logs, in a certain env. I can't find a reason 
why it doesn't crawl any page - no error/warning whatsoever.

It works just fine in my local env (win XP, sun jdk-1.5), but doesn't work in a 
hosted env (Linux 2.6.9-11.ELsmp / jre-1.5.0 / OS Architecture: i386).

I don't have access to the errornous environment, so any help/investigation is 
appreciated.

What steps will reproduce the problem?
1.
2.
3.

What is the expected output? What do you see instead?
- no crawl happen and no error/warning comes

What version of the product are you using? On what operating system?
- v1.8 / Linux 2.6.9-11.ELsmp / jre-1.5.0

Please provide any additional information below.

Thanks.
praveen ([email protected])

Original issue reported on code.google.com by [email protected] on 6 Aug 2010 at 7:30

IdleConnectionMonitorThread is never end

What steps will reproduce the problem?
1. just ran as the example (I run it on eclipse 3.4 with Jdk 1.5.0.17)
2.
3.

What is the expected output? What do you see instead?
I expect the IdleConnectionMonitorThread should be ended when it runs pass this 
line on Controller.java

 controller.start(MyCrawler.class, 10);

but the thread of IdleConnectionMonitorThread  still running

What version of the product are you using? On what operating system?
I'm using the latest code (the same as ) but I am working on Java 1.5 but I 
don't think that is an issue.


Please provide any additional information below.
I have work-around solution by just commenting this line on PageFetcher.java:

line 117:   new Thread(new 
IdleConnectionMonitorThread(connectionManager)).start();

I'm not sure whether or not it is a good solution but it works fine.

Original issue reported on code.google.com by [email protected] on 9 Nov 2010 at 4:11

All the seeds get crawled or visited before any further depth is crawled

What steps will reproduce the problem?

Add a good number of seeds, say 100,000.

What is the expected output? What do you see instead?

Crawler4j is waiting to finish up all the seeds before it even start going down 
through any single seed. I was expecting it to start the deeper depths also in 
parallel

What version of the product are you using? On what operating system?
2.6
Windows 7

Original issue reported on code.google.com by [email protected] on 2 May 2011 at 11:19

Provide depth of crawling.

Hi,
   I want to know where can we provide the depth of crawling in crawler4j?

What is the expected output? What do you see instead?
The Crawler crawls lot of data from a website.So i am unable to distinguish the 
depth of crawling for the given site.I want to provide a certain depth for 
crawling the site through the program.


What version of the product are you using? On what operating system?
crawler4j-2.2 and windows XP.


Please provide any additional information below.

Original issue reported on code.google.com by [email protected] on 28 Dec 2010 at 8:50

are there any plans to move to maven?

I was just wondering: have you thought about porting crawler4j to Maven?
I think that using maven would make it easier for other projects to include 
crawler4j as a dependency

Original issue reported on code.google.com by [email protected] on 4 May 2011 at 8:10

crawler will not follow relative URLs in redirects

What steps will reproduce the problem?
1.  Take the simple crawler example; remove all calls to controller.addSeed() 
and replace with this one
controller.addSeed("http://dairymix.com/");
2. This URL redirects. Below are the relevant headers

Server         Microsoft-IIS/6.0
X-Powered-By   ASP.NET
Location       website_import_001.htm

Of importance note that location is a relative URL.

What is the expected output? What do you see instead?
I see an exception.

java.lang.NullPointerException

    at edu.uci.ics.crawler4j.frontier.DocIDServer.getDocID(DocIDServer.java:70)
    at edu.uci.ics.crawler4j.crawler.WebCrawler.processPage(WebCrawler.java:143)
    at edu.uci.ics.crawler4j.crawler.WebCrawler.run(WebCrawler.java:108)
    at java.lang.Thread.run(Unknown Source)

Although technically relative URLs are not valid in the Location header, the 
apache HTTPClient library handles this correctly; it would be reasonable to 
assume Crawler4J  would handle this also. 

What version of the product are you using? On what operating system?
SVN trunk rev.21

Please provide any additional information below.

Please add a extra config to the crawler4j.properties so that we can override 
the default behavior and let the HTTP client library handle the redirects or 
update the  WebCrawler class to handle relative URLs in this case.

Original issue reported on code.google.com by [email protected] on 23 May 2011 at 10:23

efficiency suggestion

PageFetcher.Fetch(Page page)  is currently being used by all crawler threads as 
a utility class, it has become a bottleneck. Instead  why dont you put an 
instance of PageFetcher as an instance variable in the WebCrawler class. This 
way the crawlers wont have to wait to have its pages processed.

the changes I made:

WebCrawler gets a new PageFetcher instance variable
public class WebCrawler implements Runnable {

    private PageFetcher pagefetcher;
...
    private int preProcessPage(WebURL curURL) {
        Page page = new Page(curURL);
//      int statusCode = PageFetcher.fetch(page);
        int statusCode = pagefetcher.fetch(page);

and PageFetcher

//  public static int fetch(Page page) {
    public int fetch(Page page) {
...
//          synchronized (mutex) {
//          }

it worked very well for my case !!
I was able to go 5x faster I was downloading about 5 pages per second now I am 
at my bandwidth limit. (no, I was not being polite: 10 threads, 
fetcher.default_politeness_delay=0 my application requires a "snapshot")

please contact me with your conclusions, even if it is to tell me that my 
changes break the code, I really need to know if it does!

Original issue reported on code.google.com by [email protected] on 14 Sep 2010 at 7:21

All the seeds get crawled or visited before any further depth is crawled

What steps will reproduce the problem?

Add a good number of seeds, say 100,000.

What is the expected output? What do you see instead?

Crawler4j is waiting to finish up all the seeds before it even start going down 
through any single seed. I was expecting it to start the deeper depths also in 
parallel

What version of the product are you using? On what operating system?
2.6
Windows 7

Original issue reported on code.google.com by [email protected] on 2 May 2011 at 11:20

How to get original links in html

in HTMLParser.java add the below method to do this, i posted this code if 
anyone have the same needs.

public Set<String> fetchOriginLinks(String htmlContent) {
        HashSet<String> originUrls = new HashSet<String>();
        char[] chars = htmlContent.toCharArray();
        linkExtractor.urls.clear();
        bulletParser.setCallback(linkExtractor);
        bulletParser.parse(chars);
        Iterator<String> it = linkExtractor.urls.iterator();

        int urlCount = 0;
        while (it.hasNext()) {
            String href = it.next();
            href = href.trim();
            if (href.length() == 0) {
                continue;
            }
            String hrefWithoutProtocol = href.toLowerCase();
            if (href.startsWith("http://")) {
                hrefWithoutProtocol = href.substring(7);
            }
            if (hrefWithoutProtocol.indexOf("javascript:") < 0
                    && hrefWithoutProtocol.indexOf("@") < 0) {
                originUrls.add(href);
                urlCount++;
                if (urlCount > MAX_OUT_LINKS) {
                    break;
                }
            }
        }
        linkExtractor.urls.clear();
        return originUrls;
    }

Original issue reported on code.google.com by [email protected] on 9 Apr 2011 at 5:45

Retrieving time information about a request

This is not a bug. It is my dubt and/or request of a feature (or maybe find an 
way to retrieve additionals information).
In the method "visit" of Crawler we can access to Page object with some 
information like url , text etc. Is there a way to retrieve information about 
the 
time used to complete the request of each page?
Thanks for attention
Rob

Original issue reported on code.google.com by roberto.butti on 9 Feb 2010 at 1:20

guorouda / crawler4j Goto Github PK

crawler4j's People

Contributors

crawler4j's Issues

Recommend Projects

Recommend Topics

Recommend Org