rivermont / spidy Goto Github PK

View Code? Open in Web Editor NEW

326.0 22.0 67.0 83.75 MB

The simple, easy to use command line web crawler.

License: GNU General Public License v3.0

Python 97.91% Makefile 1.28% Dockerfile 0.81%

web-crawler web-spider python python3 crawling crawler

spidy's People

Contributors

Stargazers

Watchers

spidy's Issues

PyPI Package

It would be great to have spidy on the Package Index (the new one); installing would be only one command. I have tried to get it going, my efforts can be found on the pypi-dev branch.

There are some problems with imports, file saving, tests, etc.

While I would like to be the owner/maintainer of the package, some help getting it started would greatly appreciated.

Update for Python 3.9+, deprecate Reppy dependency

Reppy doesn't work past Python 3.8 - seomoz/reppy#122, seomoz/reppy#132 - which means our robots.txt parser isn't working (#81).
Python 3.8 also reaches end-of-life next year so this needs to happen anyway.

Branch without any robots parser
Find replacement robots parser
Test whole program for 3.9-3.11 compatibility

Getting indent error when running on fresh install.

Getting the below error

  File "crawler.py", line 217
    package='reppy')
    ^
IndentationError: unexpected indent

when running python crawler.py

Incidentally, that line was changed recently in 3a62f5e

Write comparison of spidy and similar projects

I have been asked multiple times what makes Spidy different from similar projects like Scrapy and BeautifulSoup. There is a section in the README, but it doesn't directly address other projects.

GitHub Release last in 2017?

Can you please use GitHub Releases so I can Unwatch issues and only follow via Watch-> Custom->Releases.

Save robots.txt results

Currently, a request is sent for a site's robots.txt every time a link is crawled. It would be much faster if results of a robots.txt query were saved in some database. Only one request should need to be sent.

spidy Logo Design

Possibly something reminiscent of the Python logo, but not obvious. Other than that, anything would be great.

Multiple HTTP Threads

Crawling would go much faster connecting to multiple pages at once.

Possible Problems:

Crawling same page twice
Corruption of save files if reading/writing at different places at the same time.

These should be solvable using mutexes.

Command line arguments

Arguments for:

NOTES: Don't try to use sys.argv.

Autosave triggered by single thread and not global.

Checklist

Same issue has not been opened before.

Expected Behavior

All threads to stop as crawler prints info and saves files.

Actual Behavior

Once one thread reaches SAVE_COUNT links crawled, it saves while the other threads continue. This results in [CRAWL] logs in between [INFO] logs.

It seems like this is inefficient and could result in some saving errors.

Steps to Reproduce the Problem

Run crawler
Wait for the autosave cap to be hit.

Specifications

Crawler Version: 1.6.2
Platform: Ubuntu (16.04 LTS)
Python version: 3.5.2
Dependency Versions: All latest.

PyPI Description Formatting

After multiple tries I have yet to get PyPI to format the README correctly. Current state can be viewed here.

At the moment, I followed this SO answer and convert README.md to ReStructuredText when running setup.py. However it looks like you cannot have relative links on a PyPI page, so I changed them all to https links to the GitHub files.

It's still displaying the RST as text and not rendering it, but I don't know where to go from here.

Failed crawl for http://www.frankshospitalworkshop.com/

$ docker run --rm -it -v $PWD:/data spidy
[01:01:33] [spidy] [WORKER #0] [INIT] [INFO]: Starting spidy Web Crawler version 1.6.5
[01:01:33] [spidy] [WORKER #0] [INIT] [INFO]: Report any problems to GitHub at https://github.com/rivermont/spidy
[01:01:33] [spidy] [WORKER #0] [INIT] [INFO]: Creating classes...
[01:01:33] [spidy] [WORKER #0] [INIT] [INFO]: Creating functions...
[01:01:33] [spidy] [WORKER #0] [INIT] [INFO]: Creating variables...
[01:01:33] [spidy] [WORKER #0] [INIT] [INFO]: Should spidy load settings from an available config file? (y/n):
n
[01:01:40] [spidy] [WORKER #0] [INIT] [INFO]: Please enter the following arguments. Leave blank to use the default values.
[01:01:40] [spidy] [WORKER #0] [INIT] [INPUT]: How many parallel threads should be used for crawler? (Default: 1):

[01:01:47] [spidy] [WORKER #0] [INIT] [INPUT]: Should spidy load from existing save files? (y/n) (Default: Yes):

[01:01:54] [spidy] [WORKER #0] [INIT] [INPUT]: Should spidy raise NEW errors and stop crawling? (y/n) (Default: No):

[01:01:55] [spidy] [WORKER #0] [INIT] [INPUT]: Should spidy save the pages it scrapes to the saved folder? (y/n) (Default: Yes):

[01:01:55] [spidy] [WORKER #0] [INIT] [INPUT]: Should spidy zip saved documents when autosaving? (y/n) (Default: No):

[01:01:57] [spidy] [WORKER #0] [INIT] [INPUT]: Should spidy download documents larger than 500 MB? (y/n) (Default: No):

[01:01:58] [spidy] [WORKER #0] [INIT] [INPUT]: Should spidy scrape words and save them? (y/n) (Default: Yes):

[01:01:59] [spidy] [WORKER #0] [INIT] [INPUT]: Should spidy restrict crawling to a specific domain only? (y/n) (Default: No):
y
[01:02:02] [spidy] [WORKER #0] [INIT] [INPUT]: What domain should crawling be limited to? Can be subdomains, http/https, etc.
http://www.frankshospitalworkshop.com/
[01:02:07] [spidy] [WORKER #0] [INIT] [INPUT]: Should spidy respect sites' robots.txt? (y/n) (Default: Yes):
y
[01:02:13] [spidy] [WORKER #0] [INIT] [INPUT]: What HTTP browser headers should spidy imitate?
[01:02:13] [spidy] [WORKER #0] [INIT] [INPUT]: Choices: spidy (default), Chrome, Firefox, IE, Edge, Custom:

[01:02:14] [spidy] [WORKER #0] [INIT] [INPUT]: Location of the TODO save file (Default: crawler_todo.txt):
/data/crawler_todo.txt
[01:02:24] [spidy] [WORKER #0] [INIT] [INPUT]: Location of the DONE save file (Default: crawler_done.txt):
/data/crawler_done.txt
[01:02:31] [spidy] [WORKER #0] [INIT] [INPUT]: Location of the words save file (Default: crawler_words.txt):
/data/crawler_words.txt
[01:02:38] [spidy] [WORKER #0] [INIT] [INPUT]: After how many queried links should the crawler autosave? (Default: 100):

[01:02:39] [spidy] [WORKER #0] [INIT] [INPUT]: After how many new errors should spidy stop? (Default: 5):

[01:02:40] [spidy] [WORKER #0] [INIT] [INPUT]: After how many known errors should spidy stop? (Default: 10):

[01:02:41] [spidy] [WORKER #0] [INIT] [INPUT]: After how many HTTP errors should spidy stop? (Default: 20):

[01:02:42] [spidy] [WORKER #0] [INIT] [INPUT]: After encountering how many new MIME types should spidy stop? (Default: 20):

[01:02:43] [spidy] [WORKER #0] [INIT] [INFO]: Loading save files...
[01:02:43] [spidy] [WORKER #0] [INIT] [INFO]: Successfully started spidy Web Crawler version 1.6.5...
[01:02:43] [spidy] [WORKER #0] [INIT] [INFO]: Using headers: {'User-Agent': 'spidy Web Crawler (Mozilla/5.0; bot; +https://github.com/rivermont/spidy/)', 'Accept-Language': 'en_US, en-US, en', 'Accept-Encoding': 'gzip', 'Connection': 'keep-alive'}
[01:02:43] [spidy] [WORKER #0] [INIT] [INFO]: Spawning 1 worker threads...
[01:02:43] [spidy] [WORKER #1] [INIT] [INFO]: Starting crawl...
[01:02:43] [reppy] [WORKER #0] [ROBOTS] [INFO]: Reading robots.txt file at: http://www.frankshospitalworkshop.com/robots.txt
[01:02:45] [spidy] [WORKER #1] [CRAWL] [ERROR]: An error was raised trying to process http://www.frankshospitalworkshop.com/equipment.html
[01:02:45] [spidy] [WORKER #1] [ERROR] [INFO]: An XMLSyntaxError occurred. A web dev screwed up somewhere.
[01:02:45] [spidy] [WORKER #1] [LOG] [INFO]: Saved error message and timestamp to error log file
[01:02:45] [spidy] [WORKER #0] [CRAWL] [INFO]: Stopping all threads...
[01:02:45] [spidy] [WORKER #0] [CRAWL] [INFO]: I think you've managed to download the entire internet. I guess you'll want to save your files...
[01:02:45] [spidy] [WORKER #0] [SAVE] [INFO]: Saved TODO list to /data/crawler_todo.txt
[01:02:45] [spidy] [WORKER #0] [SAVE] [INFO]: Saved DONE list to /data/crawler_todo.txt
[01:02:45] [spidy] [WORKER #0] [SAVE] [INFO]: Saved 0 words to /data/crawler_words.txt

Feature and Bug Reports

Calling any passers-by to take a moment to submit feature requests or bugs, no matter how small!

Please see the README for a general overview of this project, docs.md for some outdated documentation, and CONTRIBUTING.md for some more words.

Linux version of crawler

So far spidy has been developed on Windows for Windows, but obviously that won't work.

After doing some testing, it seems that on Linux (Ubuntu 16.04, at least) the crawler interprets folders such as config/ with the slash as part of the folder name, whereas on Windows the slash is needed to indicate being a folder.

Always download from wikipedia

Docker Container

I've been looking for a decent command-line web crawler for some time and came across this project. It seems very promising.

I have been working on getting this project working in a docker container.

Would you be interested in my contributing the dockerfile back to this project for anyone else who might be interested in the same?

unusable

hi, i tried to use spidy b.c. it looked promising.
Is it dead?

first:
sudo pip install -r requirements.txt
doest work, reppy is not installable (python 3.9)

snd:
Docker is a pita...
Please look into ConfigArgParse if you need config files BUT make sure that arguments can be used as well
with docker, there is no error log...
I ended with
docker run --rm -it -v $PWD:/data -w /data --entrypoint /src/app/spidy/crawler.py spidy
so that the error log is accessible (why is there no config option?!)

why is a suffix on the config file enforced? What is that? Windows?

thrd:
my config contained either an Ip or a hostname (resolved via /etc/hosts)
Spidy did not spider either.
For the hostname option it gave

ERROR: OSError
EXT: HTTPConnectionPool(host='example.com', port=80): Max retries exceeded with url: / (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f4ae176ecc0>: Failed to establish a new connection: [Errno -2] Name or service not known',))

Seems that it doesnt respect /etc/hosts?!
But neither did the ip option work...
e.g. '192.168.1.55/wiki/'

Cookie handling

Feature Description

Some cookie handling functionality would be pretty valuable. Setting cookies in the config file should be trivial to implement. An option to send a GET or POST every n scraped URLs / minutes, and apply the recieved cookies to subsequent request would be great too. I might do a PR if I have the time in the near future.

Checklist

This feature does not already exist.
This feature has not been requested before.

Crawler saving bad links as jumbled mess - sometimes

If you run the crawler and take a look at crawler_bad (or whatever the bad links file is for your configuration), it will have some links in byte form - another problem - but most lines will be single characters. Must fix.

Documentation Needed

CONTRIBUTING.md has some guidelines, but essentially there is simply a lot of stuff that needs filled out in the docs.

Also, if you would like to use another documentation format feel free. Listing everything is something I came up with in early development but it's probably not scalable.

Travis CI Automatic Testing

I will add a .travis.yml shortly, it may or may not work.

Docker is unusable

Expected Behavior

Docker should simplify things not make them harder

Actual Behavior

Docker is a strugle, you have to build image several times before it works:
It ignores configs that in /data directory, and use only those defaults which was in repo.
It creates results as root, etc etc.

What I've tried so far:

Best workaround is to use -v $PWD:/src/app/spidy/config/ but it still ugly

2 typos in spidy/spidy/docs/CONTRIBUTING.md

Error/Bug in code:

Expected Behavior

a project management platform that integrates with GitHub
If you make changes to crawler.py...

Actual Behavior

an project management platform hat integrates with GitHub
If you make changed to crawler.py...

Steps to Reproduce the Problem

change "an" to "a"
change "hat" to "that"
change "changed" to "changes"

What I've tried so far:

Simple spelling (typo) and grammar correction, also testing contribute process.

Specifications

Crawler Version: NA
Platform: NA
Python Version: NA
Dependency Versions: NA

Checklist

Same issue has not been opened before.

HEAD Request uses default Requests headers

The initial HEAD request sent to get the document's size is not using the set HTTP headers that the GET request is. Should be a simple fix.

Issue and PR Templates

It would be a good idea to have templates for these. I have created TEMPLATES.md as a start but would love some more input.

I looked here to get started.

Fails to crawl certain sites

Expected Behavior

crawl like hell

Actual Behavior

dies of an unknown error

Steps to Reproduce the Problem

echo "https://www.golem.de/"> ./crawler_todo.txt
spidy

What I've tried so far:

Using spidy

No error raised for incorrect input

Checklist

Same issue has not been opened before.

Expected Behavior

Raise an InputError and then stop the crawler.

Actual Behavior

No output, just exits straight to console.

Steps to Reproduce the Problem

Run crawler
Choose 'No' to config file.
Type incorrect type of input

Specifications

Crawler Version: 1.6.2
Platform: Ubuntu (16.04 LTS)
Python Version: 3.5.2
Dependency Versions: Latest

Tests not working properly

From tests.py.

test_make_words_given_string
- Fails with AttributeError: 'str' object has no attribute 'text'
- make_words needs to be passed a requests.models.Response object in order to extract to text properly, however I'm not sure there is a way to simulate that. Something like StringIO for files...?
- I feel like getting a page and passing it would take too long to be an acceptable test, but I could be wrong. Would take ~5 seconds.
test_mime_lookup_given_unknown_type
- Fails with crawler.HeaderError: Unknown MIME type: this_mime_doesn't_exist
- It should be caught with the assertRaises statement.

Tests for multithreading

Feature Description

It would be great to have a check in tests.py for the multithreading and queue.

Checklist

This feature does not already exist.
This feature has not been requested before.

Web Crawler GUI!

Having a clicky interface has been a goal for a long time now. There are many users who abhor the command line but are still interested in the tools that use them.

The remnants of a TkInter interface can be found in gui.py.
Some thoughts can be found in the docs here, as well as a wireframe sketch.

Create Windows Executable

Most likely using something like py2exe.

Fail crawling relative url and protocol

The crawler concat the child's uri relative to the parent :
https://mysite/folder/page
=> found : /js/main.js
https://mysite/folder/page/js/main.js

Same thing when a link doesn't have protocol declared :
https://mysite/folder/page
=> found : //subdomain.mysite/images/myimage.png
https://mysite/folder/page//subdomain.mysite/images/myimage.png

Install
apt-get install python3 python3-lxml python3-requests
apt-get install python3-pip python-pip
pip3 install spidy-web-crawler

Starting spidy Web Crawler version 1.6.5

Am i the only one with this problem ?

Thx for you help

Tests

I have no experience with writing Python tests, but it seems to be necessary when a program gets big.

Since the small parts of the crawler are already broken up into separate functions, one approach might be to test that each function runs without error and returns the expected type.

Platform Support

We would love to confirm that Spidy will run on all systems, and fix any bugs that may be hidden!

Install spidy either from source, through PyPI, or a GitHub Release (instructions found in the README). Run the crawler, using any config file (preferably rivermont-infinite or heavy to test all features) or a custom configuration.
Report any bugs by opening a new Issue here.
Comment your Platform specs, Python version, spidy version, etc. No information is too much.

String Index Error on perfectly normal URLs

Checklist

Same issue has not been opened before.

Expected Behavior

No errors.

Actual Behavior

Seemingly randomly, crawling a url will fail with a

string index out of range

error. There doesn't seem to be anything wrong with the URLs:

http://www.denverpost.com/breakingnews/ci_21119904
https://www.publicintegrity.org/2014/07/15/15037/decades-making-decline-irs-nonprofit-regulation
https://cdn.knightlab.com/libs/timeline3/latest/js/timeline-min.js
https://github.com/rivermont/spidy/
https://twitter.com/adamwhitcroft

Steps to Reproduce the Problem

Run the crawler.
Wait a few seconds.

What I've tried so far

Raising the error gave the traceback:

Exception in thread Thread-4:
Traceback (most recent call last):
  File "/usr/lib/python3.5/threading.py", line 914, in _bootstrap_inner
    self.run()
  File "/usr/lib/python3.5/threading.py", line 862, in run
    self._target(*self._args, **self._kwargs)
  File "crawler.py", line 260, in crawl_worker
    if link[0] == '/':
IndexError: string index out of range

Specifications

Crawler Version: 1.6.0
Platform: Linux (Ubuntu 16.04 LTS)
Dependency Versions: All latest

Respect robots.txt

There should be an option (disable-able) to ignore links that are forbidden by a site's robots.txt. Another library or huge regex might need to be used to parse out the domain the page is on.

rivermont / spidy Goto Github PK

spidy's People

Contributors

Stargazers

Watchers

Forkers

spidy's Issues

Checklist

Expected Behavior

Actual Behavior

Steps to Reproduce the Problem

Specifications

Feature Description

Checklist

Expected Behavior

Actual Behavior

What I've tried so far:

Expected Behavior

Actual Behavior

Steps to Reproduce the Problem

What I've tried so far:

Specifications

Checklist

Expected Behavior

Actual Behavior

Steps to Reproduce the Problem

What I've tried so far:

Checklist

Expected Behavior

Actual Behavior

Steps to Reproduce the Problem

Specifications

Feature Description

Checklist

Checklist

Expected Behavior

Actual Behavior

Steps to Reproduce the Problem

What I've tried so far

Specifications

Recommend Projects

Recommend Topics

Recommend Org