Giter VIP home page Giter VIP logo

spidy's People

Contributors

3onyc avatar awesomemariofan avatar dekan avatar dstjacques avatar enzosk8 avatar esouthren avatar hrily avatar iamvibhorsingh avatar j-setiawan avatar kylesalk avatar lukavia avatar mdaizovi avatar michellemorales avatar nmullane avatar quatroka avatar rivermont avatar stevelle avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

spidy's Issues

PyPI Package

It would be great to have spidy on the Package Index (the new one); installing would be only one command. I have tried to get it going, my efforts can be found on the pypi-dev branch.

There are some problems with imports, file saving, tests, etc.

While I would like to be the owner/maintainer of the package, some help getting it started would greatly appreciated.

Save robots.txt results

Currently, a request is sent for a site's robots.txt every time a link is crawled. It would be much faster if results of a robots.txt query were saved in some database. Only one request should need to be sent.

Multiple HTTP Threads

Crawling would go much faster connecting to multiple pages at once.

Possible Problems:

  • Crawling same page twice
  • Corruption of save files if reading/writing at different places at the same time.

These should be solvable using mutexes.

Command line arguments

Arguments for:

  • Overwrite existing save files
  • Raise Errors (possibly only for different severity levels?)
  • Save pages
  • Save words
  • Zip files
  • Override file size check.
  • Domain restriction
  • Respect robots.txt
  • Custom save file locations
  • Autosave count
  • HTTP headers
  • Max errors
  • Starting page

NOTES: Don't try to use sys.argv.

Autosave triggered by single thread and not global.

Checklist

  • Same issue has not been opened before.

Expected Behavior

All threads to stop as crawler prints info and saves files.

Actual Behavior

Once one thread reaches SAVE_COUNT links crawled, it saves while the other threads continue. This results in [CRAWL] logs in between [INFO] logs.

It seems like this is inefficient and could result in some saving errors.

Steps to Reproduce the Problem

  1. Run crawler
  2. Wait for the autosave cap to be hit.

Specifications

  • Crawler Version: 1.6.2
  • Platform: Ubuntu (16.04 LTS)
  • Python version: 3.5.2
  • Dependency Versions: All latest.

PyPI Description Formatting

After multiple tries I have yet to get PyPI to format the README correctly. Current state can be viewed here.

At the moment, I followed this SO answer and convert README.md to ReStructuredText when running setup.py. However it looks like you cannot have relative links on a PyPI page, so I changed them all to https links to the GitHub files.

It's still displaying the RST as text and not rendering it, but I don't know where to go from here.

Failed crawl for http://www.frankshospitalworkshop.com/

$ docker run --rm -it -v $PWD:/data spidy
[01:01:33] [spidy] [WORKER #0] [INIT] [INFO]: Starting spidy Web Crawler version 1.6.5
[01:01:33] [spidy] [WORKER #0] [INIT] [INFO]: Report any problems to GitHub at https://github.com/rivermont/spidy
[01:01:33] [spidy] [WORKER #0] [INIT] [INFO]: Creating classes...
[01:01:33] [spidy] [WORKER #0] [INIT] [INFO]: Creating functions...
[01:01:33] [spidy] [WORKER #0] [INIT] [INFO]: Creating variables...
[01:01:33] [spidy] [WORKER #0] [INIT] [INFO]: Should spidy load settings from an available config file? (y/n):
n
[01:01:40] [spidy] [WORKER #0] [INIT] [INFO]: Please enter the following arguments. Leave blank to use the default values.
[01:01:40] [spidy] [WORKER #0] [INIT] [INPUT]: How many parallel threads should be used for crawler? (Default: 1):

[01:01:47] [spidy] [WORKER #0] [INIT] [INPUT]: Should spidy load from existing save files? (y/n) (Default: Yes):

[01:01:54] [spidy] [WORKER #0] [INIT] [INPUT]: Should spidy raise NEW errors and stop crawling? (y/n) (Default: No):

[01:01:55] [spidy] [WORKER #0] [INIT] [INPUT]: Should spidy save the pages it scrapes to the saved folder? (y/n) (Default: Yes):

[01:01:55] [spidy] [WORKER #0] [INIT] [INPUT]: Should spidy zip saved documents when autosaving? (y/n) (Default: No):

[01:01:57] [spidy] [WORKER #0] [INIT] [INPUT]: Should spidy download documents larger than 500 MB? (y/n) (Default: No):

[01:01:58] [spidy] [WORKER #0] [INIT] [INPUT]: Should spidy scrape words and save them? (y/n) (Default: Yes):

[01:01:59] [spidy] [WORKER #0] [INIT] [INPUT]: Should spidy restrict crawling to a specific domain only? (y/n) (Default: No):
y
[01:02:02] [spidy] [WORKER #0] [INIT] [INPUT]: What domain should crawling be limited to? Can be subdomains, http/https, etc.
http://www.frankshospitalworkshop.com/
[01:02:07] [spidy] [WORKER #0] [INIT] [INPUT]: Should spidy respect sites' robots.txt? (y/n) (Default: Yes):
y
[01:02:13] [spidy] [WORKER #0] [INIT] [INPUT]: What HTTP browser headers should spidy imitate?
[01:02:13] [spidy] [WORKER #0] [INIT] [INPUT]: Choices: spidy (default), Chrome, Firefox, IE, Edge, Custom:

[01:02:14] [spidy] [WORKER #0] [INIT] [INPUT]: Location of the TODO save file (Default: crawler_todo.txt):
/data/crawler_todo.txt
[01:02:24] [spidy] [WORKER #0] [INIT] [INPUT]: Location of the DONE save file (Default: crawler_done.txt):
/data/crawler_done.txt
[01:02:31] [spidy] [WORKER #0] [INIT] [INPUT]: Location of the words save file (Default: crawler_words.txt):
/data/crawler_words.txt
[01:02:38] [spidy] [WORKER #0] [INIT] [INPUT]: After how many queried links should the crawler autosave? (Default: 100):

[01:02:39] [spidy] [WORKER #0] [INIT] [INPUT]: After how many new errors should spidy stop? (Default: 5):

[01:02:40] [spidy] [WORKER #0] [INIT] [INPUT]: After how many known errors should spidy stop? (Default: 10):

[01:02:41] [spidy] [WORKER #0] [INIT] [INPUT]: After how many HTTP errors should spidy stop? (Default: 20):

[01:02:42] [spidy] [WORKER #0] [INIT] [INPUT]: After encountering how many new MIME types should spidy stop? (Default: 20):

[01:02:43] [spidy] [WORKER #0] [INIT] [INFO]: Loading save files...
[01:02:43] [spidy] [WORKER #0] [INIT] [INFO]: Successfully started spidy Web Crawler version 1.6.5...
[01:02:43] [spidy] [WORKER #0] [INIT] [INFO]: Using headers: {'User-Agent': 'spidy Web Crawler (Mozilla/5.0; bot; +https://github.com/rivermont/spidy/)', 'Accept-Language': 'en_US, en-US, en', 'Accept-Encoding': 'gzip', 'Connection': 'keep-alive'}
[01:02:43] [spidy] [WORKER #0] [INIT] [INFO]: Spawning 1 worker threads...
[01:02:43] [spidy] [WORKER #1] [INIT] [INFO]: Starting crawl...
[01:02:43] [reppy] [WORKER #0] [ROBOTS] [INFO]: Reading robots.txt file at: http://www.frankshospitalworkshop.com/robots.txt
[01:02:45] [spidy] [WORKER #1] [CRAWL] [ERROR]: An error was raised trying to process http://www.frankshospitalworkshop.com/equipment.html
[01:02:45] [spidy] [WORKER #1] [ERROR] [INFO]: An XMLSyntaxError occurred. A web dev screwed up somewhere.
[01:02:45] [spidy] [WORKER #1] [LOG] [INFO]: Saved error message and timestamp to error log file
[01:02:45] [spidy] [WORKER #0] [CRAWL] [INFO]: Stopping all threads...
[01:02:45] [spidy] [WORKER #0] [CRAWL] [INFO]: I think you've managed to download the entire internet. I guess you'll want to save your files...
[01:02:45] [spidy] [WORKER #0] [SAVE] [INFO]: Saved TODO list to /data/crawler_todo.txt
[01:02:45] [spidy] [WORKER #0] [SAVE] [INFO]: Saved DONE list to /data/crawler_todo.txt
[01:02:45] [spidy] [WORKER #0] [SAVE] [INFO]: Saved 0 words to /data/crawler_words.txt

Feature and Bug Reports

Calling any passers-by to take a moment to submit feature requests or bugs, no matter how small!

Please see the README for a general overview of this project, docs.md for some outdated documentation, and CONTRIBUTING.md for some more words.

Linux version of crawler

So far spidy has been developed on Windows for Windows, but obviously that won't work.

After doing some testing, it seems that on Linux (Ubuntu 16.04, at least) the crawler interprets folders such as config/ with the slash as part of the folder name, whereas on Windows the slash is needed to indicate being a folder.

Docker Container

I've been looking for a decent command-line web crawler for some time and came across this project. It seems very promising.

I have been working on getting this project working in a docker container.

Would you be interested in my contributing the dockerfile back to this project for anyone else who might be interested in the same?

unusable

hi, i tried to use spidy b.c. it looked promising.
Is it dead?

first:
sudo pip install -r requirements.txt
doest work, reppy is not installable (python 3.9)

snd:
Docker is a pita...
Please look into ConfigArgParse if you need config files BUT make sure that arguments can be used as well
with docker, there is no error log...
I ended with
docker run --rm -it -v $PWD:/data -w /data --entrypoint /src/app/spidy/crawler.py spidy
so that the error log is accessible (why is there no config option?!)

why is a suffix on the config file enforced? What is that? Windows?

thrd:
my config contained either an Ip or a hostname (resolved via /etc/hosts)
Spidy did not spider either.
For the hostname option it gave

ERROR: OSError
EXT: HTTPConnectionPool(host='example.com', port=80): Max retries exceeded with url: / (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f4ae176ecc0>: Failed to establish a new connection: [Errno -2] Name or service not known',))

Seems that it doesnt respect /etc/hosts?!
But neither did the ip option work...
e.g. '192.168.1.55/wiki/'

Cookie handling

Feature Description

Some cookie handling functionality would be pretty valuable. Setting cookies in the config file should be trivial to implement. An option to send a GET or POST every n scraped URLs / minutes, and apply the recieved cookies to subsequent request would be great too. I might do a PR if I have the time in the near future.

Checklist

  • This feature does not already exist.
  • This feature has not been requested before.

Crawler saving bad links as jumbled mess - sometimes

If you run the crawler and take a look at crawler_bad (or whatever the bad links file is for your configuration), it will have some links in byte form - another problem - but most lines will be single characters. Must fix.

Documentation Needed

CONTRIBUTING.md has some guidelines, but essentially there is simply a lot of stuff that needs filled out in the docs.

Also, if you would like to use another documentation format feel free. Listing everything is something I came up with in early development but it's probably not scalable.

Docker is unusable

Expected Behavior

Docker should simplify things not make them harder

Actual Behavior

Docker is a strugle, you have to build image several times before it works:
It ignores configs that in /data directory, and use only those defaults which was in repo.
It creates results as root, etc etc.

What I've tried so far:

Best workaround is to use -v $PWD:/src/app/spidy/config/ but it still ugly

2 typos in spidy/spidy/docs/CONTRIBUTING.md

Error/Bug in code:

Expected Behavior

a project management platform that integrates with GitHub
If you make changes to crawler.py...

Actual Behavior

an project management platform hat integrates with GitHub
If you make changed to crawler.py...

Steps to Reproduce the Problem

  1. change "an" to "a"
  2. change "hat" to "that"
  3. change "changed" to "changes"

What I've tried so far:

Simple spelling (typo) and grammar correction, also testing contribute process.

Specifications

  • Crawler Version: NA
  • Platform: NA
  • Python Version: NA
  • Dependency Versions: NA

Checklist

  • Same issue has not been opened before.

No error raised for incorrect input

Checklist

  • Same issue has not been opened before.

Expected Behavior

Raise an InputError and then stop the crawler.

Actual Behavior

No output, just exits straight to console.

Steps to Reproduce the Problem

  1. Run crawler
  2. Choose 'No' to config file.
  3. Type incorrect type of input

Specifications

  • Crawler Version: 1.6.2
  • Platform: Ubuntu (16.04 LTS)
  • Python Version: 3.5.2
  • Dependency Versions: Latest

Tests not working properly

From tests.py.

  • test_make_words_given_string
    • Fails with AttributeError: 'str' object has no attribute 'text'
    • make_words needs to be passed a requests.models.Response object in order to extract to text properly, however I'm not sure there is a way to simulate that. Something like StringIO for files...?
    • I feel like getting a page and passing it would take too long to be an acceptable test, but I could be wrong. Would take ~5 seconds.
  • test_mime_lookup_given_unknown_type
    • Fails with crawler.HeaderError: Unknown MIME type: this_mime_doesn't_exist
    • It should be caught with the assertRaises statement.

Tests for multithreading

Feature Description

It would be great to have a check in tests.py for the multithreading and queue.

Checklist

  • This feature does not already exist.
  • This feature has not been requested before.

Web Crawler GUI!

Having a clicky interface has been a goal for a long time now. There are many users who abhor the command line but are still interested in the tools that use them.

  • The remnants of a TkInter interface can be found in gui.py.
  • Some thoughts can be found in the docs here, as well as a wireframe sketch.

Fail crawling relative url and protocol

The crawler concat the child's uri relative to the parent :
https://mysite/folder/page
=> found : /js/main.js
https://mysite/folder/page/js/main.js

Same thing when a link doesn't have protocol declared :
https://mysite/folder/page
=> found : //subdomain.mysite/images/myimage.png
https://mysite/folder/page//subdomain.mysite/images/myimage.png

Install
apt-get install python3 python3-lxml python3-requests
apt-get install python3-pip python-pip
pip3 install spidy-web-crawler

Starting spidy Web Crawler version 1.6.5

Am i the only one with this problem ?

Thx for you help

Tests

I have no experience with writing Python tests, but it seems to be necessary when a program gets big.

Since the small parts of the crawler are already broken up into separate functions, one approach might be to test that each function runs without error and returns the expected type.

Platform Support

We would love to confirm that Spidy will run on all systems, and fix any bugs that may be hidden!

  1. Install spidy either from source, through PyPI, or a GitHub Release (instructions found in the README). Run the crawler, using any config file (preferably rivermont-infinite or heavy to test all features) or a custom configuration.
  2. Report any bugs by opening a new Issue here.
  3. Comment your Platform specs, Python version, spidy version, etc. No information is too much.

String Index Error on perfectly normal URLs

Checklist

  • Same issue has not been opened before.

Expected Behavior

No errors.

Actual Behavior

Seemingly randomly, crawling a url will fail with a

string index out of range

error. There doesn't seem to be anything wrong with the URLs:

http://www.denverpost.com/breakingnews/ci_21119904
https://www.publicintegrity.org/2014/07/15/15037/decades-making-decline-irs-nonprofit-regulation
https://cdn.knightlab.com/libs/timeline3/latest/js/timeline-min.js
https://github.com/rivermont/spidy/
https://twitter.com/adamwhitcroft

Steps to Reproduce the Problem

  1. Run the crawler.
  2. Wait a few seconds.

What I've tried so far

Raising the error gave the traceback:

Exception in thread Thread-4:
Traceback (most recent call last):
  File "/usr/lib/python3.5/threading.py", line 914, in _bootstrap_inner
    self.run()
  File "/usr/lib/python3.5/threading.py", line 862, in run
    self._target(*self._args, **self._kwargs)
  File "crawler.py", line 260, in crawl_worker
    if link[0] == '/':
IndexError: string index out of range

Specifications

  • Crawler Version: 1.6.0
  • Platform: Linux (Ubuntu 16.04 LTS)
  • Dependency Versions: All latest

Respect robots.txt

There should be an option (disable-able) to ignore links that are forbidden by a site's robots.txt. Another library or huge regex might need to be used to parse out the domain the page is on.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.