mondego / crawler4py Goto Github PK
View Code? Open in Web Editor NEWA web crawler in Python
A web crawler in Python
Hi,
I am trying to read from the shelve but keep getting this error:
import shelve
shelve.open("Persistent.shelve.db")
Traceback (most recent call last):
File "SampleCrawler.py", line 13, in
shelve.open("Persistent.shelve.db")
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/shelve.py", line 239, in open
return DbfilenameShelf(filename, flag, protocol, writeback)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/shelve.py", line 223, in init
Shelf.init(self, anydbm.open(filename, flag), protocol, writeback)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/anydbm.py", line 82, in open
raise error, "db type could not be determined"
anydbm.error: db type could not be determined
Hello lordnahor. Thanks for your sharing of the crawler on git hub.
Recently I try to use the http://www.ics.uci.edu/ as the seed to crawl.
First time I crawl 10 hour to get Persistent.shelve about 150MB.
Second time, I stop at 300MB.
But I wander is there any designed "stop" situation like run out the frontier or just stop by accidentally.
One more question, I want to double check can I read "text" from the shelve file ?
Cause when I execute
d=shelve.open("Persistent.shelve.db")
print "Persistent.shelve.db",d
What I get is just
'http://www.ics.uci.edu/grad/courses/listing.php': (True, 3), 'http://www.ics.uci.edu/prospective/ko/degrees/business-information-management': (False, 4), 'http://hombao.ics.uci.edu?s=opportunities': (False, 4), 'http://asterix.ics.uci.edu/talks.html': (False, 4),
Thanks for your answer (..)
It sleeps on each thread -- so parallel threads will exceed the politeness setting.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.