rivermont / spidy Goto Github PK
View Code? Open in Web Editor NEWThe simple, easy to use command line web crawler.
License: GNU General Public License v3.0
The simple, easy to use command line web crawler.
License: GNU General Public License v3.0
It would be great to have spidy on the Package Index (the new one); installing would be only one command. I have tried to get it going, my efforts can be found on the pypi-dev
branch.
There are some problems with imports, file saving, tests, etc.
While I would like to be the owner/maintainer of the package, some help getting it started would greatly appreciated.
Reppy doesn't work past Python 3.8 - seomoz/reppy#122, seomoz/reppy#132 - which means our robots.txt parser isn't working (#81).
Python 3.8 also reaches end-of-life next year so this needs to happen anyway.
Getting the below error
File "crawler.py", line 217
package='reppy')
^
IndentationError: unexpected indent
when running python crawler.py
Incidentally, that line was changed recently in 3a62f5e
I have been asked multiple times what makes Spidy different from similar projects like Scrapy and BeautifulSoup. There is a section in the README, but it doesn't directly address other projects.
Can you please use GitHub Releases so I can Unwatch issues and only follow via Watch-> Custom->Releases.
Currently, a request is sent for a site's robots.txt
every time a link is crawled. It would be much faster if results of a robots.txt
query were saved in some database. Only one request should need to be sent.
Possibly something reminiscent of the Python logo, but not obvious. Other than that, anything would be great.
Crawling would go much faster connecting to multiple pages at once.
Possible Problems:
These should be solvable using mutexes.
Arguments for:
robots.txt
NOTES: Don't try to use sys.argv
.
All threads to stop as crawler prints info and saves files.
Once one thread reaches SAVE_COUNT
links crawled, it saves while the other threads continue. This results in [CRAWL]
logs in between [INFO]
logs.
It seems like this is inefficient and could result in some saving errors.
After multiple tries I have yet to get PyPI to format the README correctly. Current state can be viewed here.
At the moment, I followed this SO answer and convert README.md
to ReStructuredText when running setup.py
. However it looks like you cannot have relative links on a PyPI page, so I changed them all to https links to the GitHub files.
It's still displaying the RST as text and not rendering it, but I don't know where to go from here.
$ docker run --rm -it -v $PWD:/data spidy
[01:01:33] [spidy] [WORKER #0] [INIT] [INFO]: Starting spidy Web Crawler version 1.6.5
[01:01:33] [spidy] [WORKER #0] [INIT] [INFO]: Report any problems to GitHub at https://github.com/rivermont/spidy
[01:01:33] [spidy] [WORKER #0] [INIT] [INFO]: Creating classes...
[01:01:33] [spidy] [WORKER #0] [INIT] [INFO]: Creating functions...
[01:01:33] [spidy] [WORKER #0] [INIT] [INFO]: Creating variables...
[01:01:33] [spidy] [WORKER #0] [INIT] [INFO]: Should spidy load settings from an available config file? (y/n):
n
[01:01:40] [spidy] [WORKER #0] [INIT] [INFO]: Please enter the following arguments. Leave blank to use the default values.
[01:01:40] [spidy] [WORKER #0] [INIT] [INPUT]: How many parallel threads should be used for crawler? (Default: 1):
[01:01:47] [spidy] [WORKER #0] [INIT] [INPUT]: Should spidy load from existing save files? (y/n) (Default: Yes):
[01:01:54] [spidy] [WORKER #0] [INIT] [INPUT]: Should spidy raise NEW errors and stop crawling? (y/n) (Default: No):
[01:01:55] [spidy] [WORKER #0] [INIT] [INPUT]: Should spidy save the pages it scrapes to the saved folder? (y/n) (Default: Yes):
[01:01:55] [spidy] [WORKER #0] [INIT] [INPUT]: Should spidy zip saved documents when autosaving? (y/n) (Default: No):
[01:01:57] [spidy] [WORKER #0] [INIT] [INPUT]: Should spidy download documents larger than 500 MB? (y/n) (Default: No):
[01:01:58] [spidy] [WORKER #0] [INIT] [INPUT]: Should spidy scrape words and save them? (y/n) (Default: Yes):
[01:01:59] [spidy] [WORKER #0] [INIT] [INPUT]: Should spidy restrict crawling to a specific domain only? (y/n) (Default: No):
y
[01:02:02] [spidy] [WORKER #0] [INIT] [INPUT]: What domain should crawling be limited to? Can be subdomains, http/https, etc.
http://www.frankshospitalworkshop.com/
[01:02:07] [spidy] [WORKER #0] [INIT] [INPUT]: Should spidy respect sites' robots.txt? (y/n) (Default: Yes):
y
[01:02:13] [spidy] [WORKER #0] [INIT] [INPUT]: What HTTP browser headers should spidy imitate?
[01:02:13] [spidy] [WORKER #0] [INIT] [INPUT]: Choices: spidy (default), Chrome, Firefox, IE, Edge, Custom:
[01:02:14] [spidy] [WORKER #0] [INIT] [INPUT]: Location of the TODO save file (Default: crawler_todo.txt):
/data/crawler_todo.txt
[01:02:24] [spidy] [WORKER #0] [INIT] [INPUT]: Location of the DONE save file (Default: crawler_done.txt):
/data/crawler_done.txt
[01:02:31] [spidy] [WORKER #0] [INIT] [INPUT]: Location of the words save file (Default: crawler_words.txt):
/data/crawler_words.txt
[01:02:38] [spidy] [WORKER #0] [INIT] [INPUT]: After how many queried links should the crawler autosave? (Default: 100):
[01:02:39] [spidy] [WORKER #0] [INIT] [INPUT]: After how many new errors should spidy stop? (Default: 5):
[01:02:40] [spidy] [WORKER #0] [INIT] [INPUT]: After how many known errors should spidy stop? (Default: 10):
[01:02:41] [spidy] [WORKER #0] [INIT] [INPUT]: After how many HTTP errors should spidy stop? (Default: 20):
[01:02:42] [spidy] [WORKER #0] [INIT] [INPUT]: After encountering how many new MIME types should spidy stop? (Default: 20):
[01:02:43] [spidy] [WORKER #0] [INIT] [INFO]: Loading save files...
[01:02:43] [spidy] [WORKER #0] [INIT] [INFO]: Successfully started spidy Web Crawler version 1.6.5...
[01:02:43] [spidy] [WORKER #0] [INIT] [INFO]: Using headers: {'User-Agent': 'spidy Web Crawler (Mozilla/5.0; bot; +https://github.com/rivermont/spidy/)', 'Accept-Language': 'en_US, en-US, en', 'Accept-Encoding': 'gzip', 'Connection': 'keep-alive'}
[01:02:43] [spidy] [WORKER #0] [INIT] [INFO]: Spawning 1 worker threads...
[01:02:43] [spidy] [WORKER #1] [INIT] [INFO]: Starting crawl...
[01:02:43] [reppy] [WORKER #0] [ROBOTS] [INFO]: Reading robots.txt file at: http://www.frankshospitalworkshop.com/robots.txt
[01:02:45] [spidy] [WORKER #1] [CRAWL] [ERROR]: An error was raised trying to process http://www.frankshospitalworkshop.com/equipment.html
[01:02:45] [spidy] [WORKER #1] [ERROR] [INFO]: An XMLSyntaxError occurred. A web dev screwed up somewhere.
[01:02:45] [spidy] [WORKER #1] [LOG] [INFO]: Saved error message and timestamp to error log file
[01:02:45] [spidy] [WORKER #0] [CRAWL] [INFO]: Stopping all threads...
[01:02:45] [spidy] [WORKER #0] [CRAWL] [INFO]: I think you've managed to download the entire internet. I guess you'll want to save your files...
[01:02:45] [spidy] [WORKER #0] [SAVE] [INFO]: Saved TODO list to /data/crawler_todo.txt
[01:02:45] [spidy] [WORKER #0] [SAVE] [INFO]: Saved DONE list to /data/crawler_todo.txt
[01:02:45] [spidy] [WORKER #0] [SAVE] [INFO]: Saved 0 words to /data/crawler_words.txt
Calling any passers-by to take a moment to submit feature requests or bugs, no matter how small!
Please see the README for a general overview of this project, docs.md
for some outdated documentation, and CONTRIBUTING.md
for some more words.
So far spidy has been developed on Windows for Windows, but obviously that won't work.
After doing some testing, it seems that on Linux (Ubuntu 16.04, at least) the crawler interprets folders such as config/
with the slash as part of the folder name, whereas on Windows the slash is needed to indicate being a folder.
I've been looking for a decent command-line web crawler for some time and came across this project. It seems very promising.
I have been working on getting this project working in a docker container.
Would you be interested in my contributing the dockerfile back to this project for anyone else who might be interested in the same?
hi, i tried to use spidy b.c. it looked promising.
Is it dead?
first:
sudo pip install -r requirements.txt
doest work, reppy is not installable (python 3.9)
snd:
Docker is a pita...
Please look into ConfigArgParse if you need config files BUT make sure that arguments can be used as well
with docker, there is no error log...
I ended with
docker run --rm -it -v $PWD:/data -w /data --entrypoint /src/app/spidy/crawler.py spidy
so that the error log is accessible (why is there no config option?!)
why is a suffix on the config file enforced? What is that? Windows?
thrd:
my config contained either an Ip or a hostname (resolved via /etc/hosts)
Spidy did not spider either.
For the hostname option it gave
ERROR: OSError
EXT: HTTPConnectionPool(host='example.com', port=80): Max retries exceeded with url: / (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f4ae176ecc0>: Failed to establish a new connection: [Errno -2] Name or service not known',))
Seems that it doesnt respect /etc/hosts?!
But neither did the ip option work...
e.g. '192.168.1.55/wiki/'
Some cookie handling functionality would be pretty valuable. Setting cookies in the config file should be trivial to implement. An option to send a GET or POST every n scraped URLs / minutes, and apply the recieved cookies to subsequent request would be great too. I might do a PR if I have the time in the near future.
If you run the crawler and take a look at crawler_bad
(or whatever the bad links file is for your configuration), it will have some links in byte form - another problem - but most lines will be single characters. Must fix.
CONTRIBUTING.md
has some guidelines, but essentially there is simply a lot of stuff that needs filled out in the docs.
Also, if you would like to use another documentation format feel free. Listing everything is something I came up with in early development but it's probably not scalable.
I will add a .travis.yml
shortly, it may or may not work.
Docker should simplify things not make them harder
Docker is a strugle, you have to build image several times before it works:
It ignores configs that in /data
directory, and use only those defaults which was in repo.
It creates results as root, etc etc.
Best workaround is to use -v $PWD:/src/app/spidy/config/
but it still ugly
Error/Bug in code:
a project management platform that integrates with GitHub
If you make changes to crawler.py...
an project management platform hat integrates with GitHub
If you make changed to crawler.py...
Simple spelling (typo) and grammar correction, also testing contribute process.
The initial HEAD request sent to get the document's size is not using the set HTTP headers that the GET request is. Should be a simple fix.
It would be a good idea to have templates for these. I have created TEMPLATES.md
as a start but would love some more input.
I looked here to get started.
crawl like hell
dies of an unknown error
echo "https://www.golem.de/"> ./crawler_todo.txt
spidy
Using spidy
Raise an InputError
and then stop the crawler.
No output, just exits straight to console.
From tests.py
.
test_make_words_given_string
AttributeError: 'str' object has no attribute 'text'
make_words
needs to be passed a requests.models.Response
object in order to extract to text properly, however I'm not sure there is a way to simulate that. Something like StringIO
for files...?test_mime_lookup_given_unknown_type
crawler.HeaderError: Unknown MIME type: this_mime_doesn't_exist
assertRaises
statement.It would be great to have a check in tests.py
for the multithreading and queue.
Most likely using something like py2exe.
The crawler concat the child's uri relative to the parent :
https://mysite/folder/page
=> found : /js/main.js
https://mysite/folder/page/js/main.js
Same thing when a link doesn't have protocol declared :
https://mysite/folder/page
=> found : //subdomain.mysite/images/myimage.png
https://mysite/folder/page//subdomain.mysite/images/myimage.png
Install
apt-get install python3 python3-lxml python3-requests
apt-get install python3-pip python-pip
pip3 install spidy-web-crawler
Starting spidy Web Crawler version 1.6.5
Am i the only one with this problem ?
Thx for you help
I have no experience with writing Python tests, but it seems to be necessary when a program gets big.
Since the small parts of the crawler are already broken up into separate functions, one approach might be to test that each function runs without error and returns the expected type.
We would love to confirm that Spidy will run on all systems, and fix any bugs that may be hidden!
rivermont-infinite
or heavy
to test all features) or a custom configuration.No errors.
Seemingly randomly, crawling a url will fail with a
string index out of range
error. There doesn't seem to be anything wrong with the URLs:
http://www.denverpost.com/breakingnews/ci_21119904
https://www.publicintegrity.org/2014/07/15/15037/decades-making-decline-irs-nonprofit-regulation
https://cdn.knightlab.com/libs/timeline3/latest/js/timeline-min.js
https://github.com/rivermont/spidy/
https://twitter.com/adamwhitcroft
Raising the error gave the traceback:
Exception in thread Thread-4:
Traceback (most recent call last):
File "/usr/lib/python3.5/threading.py", line 914, in _bootstrap_inner
self.run()
File "/usr/lib/python3.5/threading.py", line 862, in run
self._target(*self._args, **self._kwargs)
File "crawler.py", line 260, in crawl_worker
if link[0] == '/':
IndexError: string index out of range
There should be an option (disable-able) to ignore links that are forbidden by a site's robots.txt
. Another library or huge regex might need to be used to parse out the domain the page is on.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.