postmodern / spidr Goto Github PK

A versatile Ruby web spidering library that can spider a site, multiple domains, certain links or infinitely. Spidr is designed to be fast and easy to use.

License: MIT License

Ruby 100.00%

spider ruby spider-links crawler web scraper web-scraping web-spider web-crawler web-scraper

spidr's People

Contributors

Stargazers

Watchers

Forkers

evoltech qhoxie zapnap landonclark jagtesh nu7hatch sgonyea emseabrown rfletcher wangmh fakechris ranjithtenz flyabroadkit bmalek openkava redhai dzl84394 mogotest nukturnal carolhsu maccman ezkl weigang992003 beunwa hexiyou xieyunzi spangey efrat-safanov islane imagesh crunis niuniuwjy mojied mzbotr voodoologic cloudxtreme mika-cn limanxian robfuller ericmason brossetti1 clearbit nofxx ovo-works dfockler aniltimt joshcheek cnlouts spk tricknotes xinggg sword2g michaelfriedberg iamlos kyaroch mgiacomini dcadenas akadanpaul 0xbadca7 chava jward-magento victorperez mothonmars pennkao cprodhomme amandalynnlarson sprocketsecurity runtastic fuckup1337 sfritz buren vfonic alpharootbeta waheedi arryboom tdooner ricp hoanglannet ordinarymagic redithion andydna mmg1 dallasgoldswain juneira charlessellers hfl strogo intrigueio shiva2405 standardgalactic bearerpipelinetest nikoma davidsauntson allengray01 dubz3r0 dafaulkensllc stalane secos iq-scm excloudx6

spidr's Issues

Multithreading

Hi there,

I was wondering if it would be possible to multithread the spidr gem? I don't know much about multithreading in ruby, but I believe only Ruby 1.9.x is able to do so?

I had a look through the source but couldn't find where the spidr gem makes its http requests.

Maybe something like Typhoeus can be used?! (http://rubygems.org/gems/typhoeus)

Thanks,
Ryan

First off thanks for the great work on spidr! Second, I noticed that spidr kinda dies on sites that completely fail at SSL. For instance https://36pizza.com causes spidr to crash with an ssl error. This can be resolved by adding OpenSSL::SSL::SSLError to the rescue clause on lib/agent.rb:684. Right now I don't have time to fork and make a pull request so I'm monkeypatching but I thought you should know.

Crawling a specific page

Hey! I was wondering if there was a way to return all links found on a specific page. So far spidr has been great for crawling a whole site but with my testing I'd like to be able to focus on one page.

Thanks

Add ignore_paths and ignore_paths_like

Add methods/options for filtering URLs by path.

Following redirects

Howdy! Just wondering if i'm implementing this right. I need to follow redirects, and there doesnt seem to be an option toggle so I tried implementing it this way. It seems to work, but would like some feedback!

Spidr.site(@url, max_depth: 2, limit: 20) do |spider|
  spider.every_redirect_page do |page|
    spider.visit_hosts << URI.parse(page.location).host
    spider.enqueue page.location
  end
end

Problem building gem

When trying to build the gem for the edge version of this library I get the following message:

dwilliams@lists:~/src/spidr$ rake gem
(in /home/dwilliams/src/spidr)
rake aborted!
undefined method `yard_opts' for #<Hoe:0xb78fb5e4>
/home/dwilliams/src/spidr/Rakefile:14
(See full trace by running task with --trace)
dwilliams@lists:~/src/spidr$ rake gem
(in /home/dwilliams/src/spidr)
rake aborted!
undefined method `yard_opts' for #<Hoe:0xb78c0430>
/home/dwilliams/src/spidr/Rakefile:14
(See full trace by running task with --trace)`

and the trace is this:

dwilliams@lists:~/src/spidr$ rake --trace gem
(in /home/dwilliams/src/spidr)
rake aborted!
undefined method `yard_opts' for #<Hoe:0xb79330ac>
/home/dwilliams/src/spidr/Rakefile:14
/usr/lib/ruby/gems/1.8/gems/hoe-2.5.0/lib/hoe.rb:292:in `instance_eval'
/usr/lib/ruby/gems/1.8/gems/hoe-2.5.0/lib/hoe.rb:292:in `spec'
/home/dwilliams/src/spidr/Rakefile:9
/usr/lib/ruby/gems/1.8/gems/rake-0.8.7/lib/rake.rb:2383:in `load'
/usr/lib/ruby/gems/1.8/gems/rake-0.8.7/lib/rake.rb:2383:in `raw_load_rakefile'
/usr/lib/ruby/gems/1.8/gems/rake-0.8.7/lib/rake.rb:2017:in `load_rakefile'
/usr/lib/ruby/gems/1.8/gems/rake-0.8.7/lib/rake.rb:2068:in `standard_exception_handling'
/usr/lib/ruby/gems/1.8/gems/rake-0.8.7/lib/rake.rb:2016:in `load_rakefile'
/usr/lib/ruby/gems/1.8/gems/rake-0.8.7/lib/rake.rb:2000:in `run'
/usr/lib/ruby/gems/1.8/gems/rake-0.8.7/lib/rake.rb:2068:in `standard_exception_handling'
/usr/lib/ruby/gems/1.8/gems/rake-0.8.7/lib/rake.rb:1998:in `run'
/usr/lib/ruby/gems/1.8/gems/rake-0.8.7/bin/rake:31
/usr/bin/rake:19:in `load'
/usr/bin/rake:19

What obvious thing am I missing this time? I have noticed that I get this issue with a few of your other libraries as well.

handle css @import spidering

Discussed in IRC the other day. Noting it here for posterity. Could look into using http://github.com/alexdunae/css_parser for this, although there may be a more efficient path.

parser = CssParser::Parser.new
parser.load_uri!(uri)
parser.loaded_uris
=> [uri, imported_uri_1, imported_uri_2, etc]

HTTP Basic auth problem

If site have http basic auth then spidr is failed.
First page is http://user:[email protected], but inside pages have http://example.com/*.

http://user:[email protected] have link to http://user:[email protected]/foo
spidr get http://example.com/foo, but not http://user:[email protected]/foo and return 401 error.
Why?

expected absolute path component: sites/ftp.apache.org/

Get this error while spidering http://apache.org.

The URL that breaks the spider seems to be this one:
www.mirrorservice.org/sites/ftp.apache.org/ (looks although theres a domain when infact its a path)

/usr/lib/ruby/1.8/uri/generic.rb:475:in check_path': bad component(expected absolute path component): sites/ftp.apache.org/ (URI::InvalidComponentError) from /usr/lib/ruby/1.8/uri/generic.rb:495:inpath='
from /var/lib/gems/1.8/gems/spidr-0.2.7/lib/spidr/page.rb:537:in
to_absolute' from /var/lib/gems/1.8/gems/spidr-0.2.7/lib/spidr/page.rb:514:inurls'
from /var/lib/gems/1.8/gems/spidr-0.2.7/lib/spidr/page.rb:514:in map' from /var/lib/gems/1.8/gems/spidr-0.2.7/lib/spidr/page.rb:514:inurls'
from /var/lib/gems/1.8/gems/spidr-0.2.7/lib/spidr/agent.rb:587:in
visit_page' from /var/lib/gems/1.8/gems/spidr-0.2.7/lib/spidr/agent.rb:513:in get_page'
from /var/lib/gems/1.8/gems/spidr-0.2.7/lib/spidr/agent.rb:678:in
prepare_request' from /var/lib/gems/1.8/gems/spidr-0.2.7/lib/spidr/agent.rb:507:in get_page'
from /var/lib/gems/1.8/gems/spidr-0.2.7/lib/spidr/agent.rb:573:in
visit_page' from /var/lib/gems/1.8/gems/spidr-0.2.7/lib/spidr/agent.rb:244:inrun'
from /var/lib/gems/1.8/gems/spidr-0.2.7/lib/spidr/agent.rb:226:in
start_at' from /var/lib/gems/1.8/gems/spidr-0.2.7/lib/spidr/agent.rb:197:insite'
from /var/lib/gems/1.8/gems/spidr-0.2.7/lib/spidr/agent.rb:124:in
initialize' from /var/lib/gems/1.8/gems/spidr-0.2.7/lib/spidr/agent.rb:194:innew'
from /var/lib/gems/1.8/gems/spidr-0.2.7/lib/spidr/agent.rb:194:in site' from /var/lib/gems/1.8/gems/spidr-0.2.7/lib/spidr/spidr.rb:96:insite'

I will do some further investigating and see if I can come up with a fix.

ryan

URL Normalization.

Hey postmodern,

It seems like the usage of File.expand_path in Spidr::Page#to_absolute can goof up URLs in a very minor way. Observe:

irb(main):001:0> a = '/somedir/'
=> "/somedir/"
irb(main):002:0> File.expand_path(a)
=> "/somedir"

Imagine that you go to a site 'http://www.foo.com/somedir' ... '/somedir' is a directory, and the server responds with:

HTTP/1.1 301 Moved Permanently
Date: Mon, 21 Sep 2009 23:08:19 GMT
Server: Apache/2.0.63 (CentOS)
Location: http://www.foo.com/somedir/
.....

Requesting 'http://www.foo.com/somedir/' yields
HTTP/1.1 200 OK
Date: Mon, 21 Sep 2009 23:10:32 GMT
Server: Apache/2.0.63 (CentOS)
.....

When to_absolute normalizes 'http://www.foo.com/somedir/', it ends up coming out of the method as 'http://www.foo.com/somedir', which it has already visited.

In the real world 'http://www.foo.com/somedir' != 'http://www.foo.com/somedir/' ... File.expand_path doesn't know the difference between the two, but to an HTTP server they are two different things.

~Mike

Spidr behaves oddly on bad host-names under Ruby 1.8.7-p249

Just noticed unusual exceptions coming from Net::HTTP when running Spidr on the WSOC under Ruby 1.8.7-p249.

NoMethodError in 'Spidr::Agent before(:all)'
undefined method `closed?' for nil:NilClass
/usr/lib64/ruby/1.8/net/http.rb:1060:in `request'
/usr/lib64/ruby/1.8/net/http.rb:772:in `get'
/underground/code/spidr/spec/helpers/wsoc.rb:69:in `run_course'
./spec/agent_spec.rb:10:

every_html_page shouldn't process javascripts

Hi,

i think there is some problem with every_html_page. It's unnecessarily process javascript files in my app.

Thanks in advice!

Automatically detect and parse sitemap.xml

Automatically detecting and parsing /sitemap.xml might be a good way to cut down on spidering depth.

Link depth?

Hi,

I am sure I saw somewhere that 'start_at' accepted a 'depth' parameter. I can't seem to find any reference to it any longer. Either way if I wasn't just imagining things, the depth option doesn't seem to be working. I grepped through the code and couldn't find it either.

Here is a pastie of a test spider of my blog:
http://pastie.org/1754398

Is there a depth option? If so, am I using it wrong? or is it not working?

Thanks a million for any help. :)

Page#to_absolut raises URI::InvalidURIError: path conflicts with opaque

Thanks for an awesome gem! 🌟

When crawling a site, this exception was raised:

URI::InvalidURIError: path conflicts with opaque
from $HOME/.rubies/ruby-2.4.0/lib/ruby/2.4.0/uri/generic.rb:761:in `check_path'
from $HOME/.rubies/ruby-2.4.0/lib/ruby/2.4.0/uri/generic.rb:817:in `path='
from $HOME/.gem/ruby/2.4.0/gems/spidr-0.6.0/lib/spidr/page/html.rb:283:in `to_absolute'
from $HOME/.gem/ruby/2.4.0/gems/spidr-0.6.0/lib/spidr/page/html.rb:239:in `block in each_url'
from $HOME/.gem/ruby/2.4.0/gems/spidr-0.6.0/lib/spidr/page/html.rb:182:in `block in each_link'
from $HOME/.gem/ruby/2.4.0/gems/spidr-0.6.0/lib/spidr/page/html.rb:189:in `block in each_link'
from $HOME/.gem/ruby/2.4.0/gems/nokogiri-1.7.0.1/lib/nokogiri/xml/node_set.rb:187:in `block in each'
from $HOME/.gem/ruby/2.4.0/gems/nokogiri-1.7.0.1/lib/nokogiri/xml/node_set.rb:186:in `upto'
from $HOME/.gem/ruby/2.4.0/gems/nokogiri-1.7.0.1/lib/nokogiri/xml/node_set.rb:186:in `each'
from $HOME/.gem/ruby/2.4.0/gems/spidr-0.6.0/lib/spidr/page/html.rb:188:in `each_link'
from $HOME/.gem/ruby/2.4.0/gems/spidr-0.6.0/lib/spidr/page/html.rb:238:in `each_url'
from $HOME/.gem/ruby/2.4.0/gems/spidr-0.6.0/lib/spidr/agent.rb:679:in `block in visit_page'
from $HOME/.gem/ruby/2.4.0/gems/spidr-0.6.0/lib/spidr/agent.rb:605:in `block in get_page'
from $HOME/.gem/ruby/2.4.0/gems/spidr-0.6.0/lib/spidr/agent.rb:788:in `prepare_request'
from $HOME/.gem/ruby/2.4.0/gems/spidr-0.6.0/lib/spidr/agent.rb:599:in `get_page'
from $HOME/.gem/ruby/2.4.0/gems/spidr-0.6.0/lib/spidr/agent.rb:665:in `visit_page'

The line in spidr that raises this error is lib/spidr/page/html.rb:283:

new_url.path = URI.expand_path(path)

URI#path= calls URI#check_path which raises the error, see ruby docs for URI:Generic#check_path.

I'm not really sure what the best way to go about this would be, perhaps catching URI::InvalidURIError and returning nil is sensible, since nil already can be returned from Page#to_absolute?

Skip processing of pages

In the documentation says that is possible to skip processing some pages, but I can not find how I can do it, I have tried with ignore_links or ignore_pages but nothing sames to work, eg:

spider = Spidr.site('.....', ignore_links: [%{^/blog/}]) do |spider|
spider.every_html_page do |page|
//here I still get pages with the /blog url
end
end

How I can ignore some pages based in the URL?

Any way to automatically obey robots.txt?

I guess I could load the robots.txt file of a site but is there a way to turn this on so that it will always follow the rules?

`ignore_links` not working.

Hello. I am loving this library! But I have an issue.

I am collecting URLS of already scraped pages inside an array and for later continuation of the process am using ignore_links to skip them.

However, it's not working. The URLs are collected by page.url and are then fed into ignore_links later on as absolute URL strings. The page I am scraping references it's content by relative links.

linkregs = [] # regexes, working fine
ignore = [] # read from file
Spidr.start_at("http://example.com", links: linkregs, ignore_links: ignore) do |spidr|
  spidr.every_page do |page|
    if ignore.include?(page.url.to_s)
      # this is the problem
      puts "Error!!"
    end
    ignore.push(page.url.to_s)
  end
end
# save ignore to file

path conflicts with opaque (URI::InvalidURIError)

I'm trying to crawl stackoverflow but the crawler keeps on giving me this error. apparently the problem is happening whenever it reaches the following link

I'm not sure how to fix it. since
"subject=Stack%20Overflow%20Question&body=Time%20series%20speed%20forecasting%20using%20regression%20with%20exogenous%20variables%0Ahttps%3a%2f%2fstackoverflow.com%2fq%2f49618734%3fsem%3d2"

Traceback (most recent call last): 21: from main.rb:4:in

'
20: from /Users/mustafakhalil/Projects/Senior/crawler/crawler.rb:20:in start_crawling' 19: from /usr/local/lib/ruby/gems/2.5.0/gems/spidr-0.6.0/lib/spidr/spidr.rb:53:in site'
18: from /usr/local/lib/ruby/gems/2.5.0/gems/spidr-0.6.0/lib/spidr/agent.rb:274:in site' 17: from /usr/local/lib/ruby/gems/2.5.0/gems/spidr-0.6.0/lib/spidr/agent.rb:355:in start_at'
16: from /usr/local/lib/ruby/gems/2.5.0/gems/spidr-0.6.0/lib/spidr/agent.rb:373:in run' 15: from /usr/local/lib/ruby/gems/2.5.0/gems/spidr-0.6.0/lib/spidr/agent.rb:665:in visit_page'
14: from /usr/local/lib/ruby/gems/2.5.0/gems/spidr-0.6.0/lib/spidr/agent.rb:599:in get_page' 13: from /usr/local/lib/ruby/gems/2.5.0/gems/spidr-0.6.0/lib/spidr/agent.rb:788:in prepare_request'
12: from /usr/local/lib/ruby/gems/2.5.0/gems/spidr-0.6.0/lib/spidr/agent.rb:605:in block in get_page' 11: from /usr/local/lib/ruby/gems/2.5.0/gems/spidr-0.6.0/lib/spidr/agent.rb:679:in block in visit_page'
10: from /usr/local/lib/ruby/gems/2.5.0/gems/spidr-0.6.0/lib/spidr/page/html.rb:238:in each_url' 9: from /usr/local/lib/ruby/gems/2.5.0/gems/spidr-0.6.0/lib/spidr/page/html.rb:188:in each_link'
8: from /usr/local/lib/ruby/gems/2.5.0/gems/nokogiri-1.8.2/lib/nokogiri/xml/node_set.rb:189:in each' 7: from /usr/local/lib/ruby/gems/2.5.0/gems/nokogiri-1.8.2/lib/nokogiri/xml/node_set.rb:189:in upto'
6: from /usr/local/lib/ruby/gems/2.5.0/gems/nokogiri-1.8.2/lib/nokogiri/xml/node_set.rb:190:in block in each' 5: from /usr/local/lib/ruby/gems/2.5.0/gems/spidr-0.6.0/lib/spidr/page/html.rb:189:in block in each_link'
4: from /usr/local/lib/ruby/gems/2.5.0/gems/spidr-0.6.0/lib/spidr/page/html.rb:182:in block in each_link' 3: from /usr/local/lib/ruby/gems/2.5.0/gems/spidr-0.6.0/lib/spidr/page/html.rb:239:in block in each_url'
2: from /usr/local/lib/ruby/gems/2.5.0/gems/spidr-0.6.0/lib/spidr/page/html.rb:283:in to_absolute' 1: from /usr/local/Cellar/ruby/2.5.0_2/lib/ruby/2.5.0/uri/generic.rb:822:in path='
/usr/local/Cellar/ruby/2.5.0_2/lib/ruby/2.5.0/uri/generic.rb:766:in check_path': path conflicts with opaque (URI::InvalidURIError)

How can I 'ignore everything except' a set of links

In the Examples section, I've got the 'Do not spider certain links' operation to work.

In my case 'Spidr.site('http://www.parkers.co.uk/', :ignore_links => [/vans/])

Correctly displays all URLs except those starting with /vans/.

However, I really need to also 'ignore' all links EXCEPT for /vans/, so that ONLY /vans/ urls are displayed.

Is this possible?

There are too many possibilities to just add to the list of 'ignore_links', I really need to 'ignore everything except' for /vans/.

thanks!

Thank you

I'm opening this issue for the sole reason to say, thank you so much for your hard work 🙏.

also as a side note, I run the specs against ruby 2.5.3 using latest RubyGems 3 and bundler 2, and minor updates of the dependencies and all the specs were green 💚.

Store history queue in Hash of host:port and paths.

To reduce lookup time in the Spidr::Agent#queue, we can store the URLs in a Hash of the unique host:port pair and the URL paths. This will also facilitate events for when new hosts are seen.

Session handling

Is it possible for Spidr to follow some setup param like visiting a page and logging in before spidering the remaining page?

Rewrite spec/helpers/wsoc.rb as a shared_example with 100% less eval

Rewrite the module as a shared_example and do not use eval.

Passing Headers to Net::HTTP

in agent.rb:483 it seems like you are passing path and headers to Net::HTTP.get(), but according to the [[url:http://ruby-doc.org/stdlib/libdoc/net/http/rdoc/classes/Net/HTTP.html#M000655|documentation]] this method only takes host, path, and port.

I am trying to add the addition client header "Host" but am obviously missing something here in your implementation. Can you clarify?

The use case is having host be an ip address, but the Host header being the domain name allowing spidr-ing of sites that arn't in dns but hosted through vhost.

Infinite path loop

Hi,

It seems that when $_SERVER['REQUEST_URI'] or similar is used AND the web server is configured to return custom error pages (including 200 statuses), Spidr ends up in an infinite loop.

In this particular case the problem URL is in a POST form action element, but I don't think it matters where the URL appears.

Eventually ends up with pages like so:

http://www.example.com/dir/js/js/js/js/js/js/js/js/js/js/js/js/js/js/js/js/js/js/js/js/js/js/js/js/js/js/js/js/js/js/js/js/js/js/js/js/js/js/js/js/js/js/js/js/js/js/js/js/js/js/js/js/js/js/js/js/js/js/js/js/js/js/js/js/js/js/js/js/js/js/js/js/js/js/js/js/js/somefile.js

I'm not sure how this could be solved, the depth option may help cut down on the false positive URLs but wouldn't solve the problem.

Thanks,
Ryan

99% of cpu usage while crawling bigger websites...

Hello, when i'm crawling quite big websites (with :depth => 5) spidr eats all my server's resources o_O". I'm using it in many threads (within eventmachine) so I think it can be something wrong with threads safety. Any ideas?

SSL session reuse may fail

I've just run into a situation where the reuse of an SSL session caused an exception and Spidr subsequently skipped the page. Currently, the exception is silently swallowed, so I modified it to grab the following trace:

EOFError (end of file reached):
  /home/nirvdrum/.rvm/rubies/ruby-1.9.3-p0/lib/ruby/1.9.1/openssl/buffering.rb:174:in `sysread_nonblock'
  /home/nirvdrum/.rvm/rubies/ruby-1.9.3-p0/lib/ruby/1.9.1/openssl/buffering.rb:174:in `read_nonblock'
  /home/nirvdrum/.rvm/rubies/ruby-1.9.3-p0/lib/ruby/1.9.1/net/protocol.rb:141:in `rbuf_fill'
  /home/nirvdrum/.rvm/rubies/ruby-1.9.3-p0/lib/ruby/1.9.1/net/protocol.rb:122:in `readuntil'
  /home/nirvdrum/.rvm/rubies/ruby-1.9.3-p0/lib/ruby/1.9.1/net/protocol.rb:132:in `readline'
  /home/nirvdrum/.rvm/rubies/ruby-1.9.3-p0/lib/ruby/1.9.1/net/http.rb:2562:in `read_status_line'
  /home/nirvdrum/.rvm/rubies/ruby-1.9.3-p0/lib/ruby/1.9.1/net/http.rb:2551:in `read_new'
  /home/nirvdrum/.rvm/rubies/ruby-1.9.3-p0/lib/ruby/1.9.1/net/http.rb:1319:in `block in transport_request'
  /home/nirvdrum/.rvm/rubies/ruby-1.9.3-p0/lib/ruby/1.9.1/net/http.rb:1316:in `catch'
  /home/nirvdrum/.rvm/rubies/ruby-1.9.3-p0/lib/ruby/1.9.1/net/http.rb:1316:in `transport_request'
  /home/nirvdrum/.rvm/rubies/ruby-1.9.3-p0/lib/ruby/1.9.1/net/http.rb:1293:in `request'
  rest-client (1.6.7) lib/restclient/net_http_ext.rb:51:in `request'
  /home/nirvdrum/.rvm/rubies/ruby-1.9.3-p0/lib/ruby/1.9.1/net/http.rb:1026:in `get'
  spidr (0.4.1) lib/spidr/agent.rb:513:in `block in get_page'
  spidr (0.4.1) lib/spidr/agent.rb:684:in `prepare_request'
  spidr (0.4.1) lib/spidr/agent.rb:512:in `get_page'
  app/models/cookie_login_option.rb:150:in `fetch_remote_form'
  app/models/cookie_login_option.rb:158:in `block in fetch_remote_form'
  spidr (0.4.1) lib/spidr/agent.rb:518:in `block in get_page'
  spidr (0.4.1) lib/spidr/agent.rb:684:in `prepare_request'
  spidr (0.4.1) lib/spidr/agent.rb:512:in `get_page'
  app/models/cookie_login_option.rb:150:in `fetch_remote_form'
  app/models/cookie_login_option.rb:158:in `block in fetch_remote_form'
  spidr (0.4.1) lib/spidr/agent.rb:518:in `block in get_page'
  spidr (0.4.1) lib/spidr/agent.rb:684:in `prepare_request'
  spidr (0.4.1) lib/spidr/agent.rb:512:in `get_page'

If I modify the code to remove the session cache, I am able to fetch the page okay. It might be good to catch EOFError and retry with a new session in the event this happens. Catching the error all over the place could be messy though.

Use webmock and to_rack in specs

Make wsoc a dependency of spidr and route all requests to it using webmock's .to_rack method. https://robots.thoughtbot.com/how-to-stub-external-services-in-tests

unable to ignore links

cool gem!

I'm trying to ignore going to partners and everything after it in my site www.mysite.com/partners/resellers and is still going to those links.

   root = args[:url]

    url_map = Hash.new { |hash,key| hash[key] = [] }

    spider = Spidr.site(root, ignore_links_like: [%{^/partners/}]) do |spider|
      spider.every_url { |url| puts (url) }
      spider.every_failed_url { |url| puts "Failed url #{url}" }
      spider.every_link do |origin,dest|
        url_map[dest] << origin
      end
    end

    spider.failures.each do |url|
      puts "Broken link #{url} found in:"

      url_map[url].each { |page| puts ("  #{page}").red }
    end

#<NoMethodError: undefined method `closed?' for nil:NilClass>

Hi.

I got the following error while spidering a site. I suspect it was because the remote site dropped the connection, however I am unsure.

<NoMethodError: undefined method `closed?' for nil:NilClass>

/usr/lib/ruby/1.8/net/http.rb:1060:in request' /usr/lib/ruby/1.8/net/http.rb:772:inget'
/usr/lib/ruby/gems/1.8/gems/spidr-0.3.1/lib/spidr/agent.rb:521:in get_page' /usr/lib/ruby/gems/1.8/gems/spidr-0.3.1/lib/spidr/agent.rb:693:inprepare_request'
/usr/lib/ruby/gems/1.8/gems/spidr-0.3.1/lib/spidr/agent.rb:520:in get_page' /usr/lib/ruby/gems/1.8/gems/spidr-0.3.1/lib/spidr/agent.rb:586:invisit_page'
/usr/lib/ruby/gems/1.8/gems/spidr-0.3.1/lib/spidr/agent.rb:256:in run' /usr/lib/ruby/gems/1.8/gems/spidr-0.3.1/lib/spidr/agent.rb:238:instart_at'
/usr/lib/ruby/gems/1.8/gems/spidr-0.3.1/lib/spidr/agent.rb:209:in site' /usr/lib/ruby/gems/1.8/gems/spidr-0.3.1/lib/spidr/agent.rb:136:ininitialize'
/usr/lib/ruby/gems/1.8/gems/spidr-0.3.1/lib/spidr/agent.rb:206:in new' /usr/lib/ruby/gems/1.8/gems/spidr-0.3.1/lib/spidr/agent.rb:206:insite'
/usr/lib/ruby/gems/1.8/gems/spidr-0.3.1/lib/spidr/spidr.rb:96:in `site'

I was thinking that a possible solution would be to wrap the agent.rb get_page method contents in a begin/rescue block, report the error, but carry on, or maybe try again?

def get_page(url)
url = URI(url.to_s)

  begin
  prepare_request(url) do |session,path,headers|
    new_page = Page.new(url,session.get(path,headers))

    # save any new cookies
    @cookies.from_page(new_page)

    yield new_page if block_given?
    return new_page
  end
 rescue => e
      puts '+++++ ERROR IN SPIDR GEM ' + e.inspect
      return ''
 end
end

SSL and network, in general

A couple of things

From everything I can tell about the net/https , you need to set 'use_ssl' to true when you want things to run over SSL. You'd also want to turn off certificate validation, as since we are spidering, we really don't care about the validity, just connectivity. Ex:
http = Net::HTTP.new(host,port)
http.use_ssl = true
http.verify_mode = OpenSSL::SSL::VERIFY_NONE
You really, really, really need to re-use your HTTP session objects once they are created. Right now, Agent will create and destroy a full connection for each request. With SSL is involed, the overhead for negotiating an SSL connection will kill a CPU.
You can easily share your sessions if you create a hash that is keyed off of [host,port]. If you run into an error while making an HTTP request, destroy the session and recreate it. You'll be able to run hundreds, if not thousands of HTTP requests over a single TCP connection.

Odd Net::HTTP error when requesting the last page.

Noticed this exception from within Net::HTTP when requesting the last page:

NoMethodError: undefined method `closed?' for nil:NilClass
    from /usr/lib64/ruby/1.8/net/http.rb:1060:in `request'
    from /usr/lib64/ruby/1.8/net/http.rb:772:in `get'
    from ./lib/spidr/agent.rb:501
    from ./lib/spidr/agent.rb:671:in `call'
    from ./lib/spidr/agent.rb:671:in `prepare_request'
    from ./lib/spidr/agent.rb:500:in `get_page'
    from ./lib/spidr/agent.rb:566:in `visit_page'
    from ./lib/spidr/agent.rb:244:in `run'
    from ./lib/spidr/agent.rb:226:in `start_at'
    from ./lib/spidr/agent.rb:171
    from ./lib/spidr/agent.rb:124:in `call'
    from ./lib/spidr/agent.rb:124:in `initialize'
    from ./lib/spidr/agent.rb:168:in `new'
    from ./lib/spidr/agent.rb:168:in `host'
    from ./lib/spidr/spidr.rb:89:in `host'
    from (irb):2

/../foo expands to just "foo"

Found an edge-case in URI.expand_path. /../foo expands to just foo, which is not an absolute path.

Multiple cookies from same page

Hi,
This is probably more of a feature request than a bug.

If a page sets more than one cookie;


if (!isset ($_COOKIE["TestCookie"])) {

setcookie("TestCookie", "value");

}
if (!isset ($_COOKIE["TestCookie1"])) {

setcookie("TestCookie1", "value");

}
echo $_COOKIE["TestCookie"];

echo $_COOKIE["TestCookie1"];

?>

Spidr returns;


puts page.cookies.to_s # => TestCookie=valueTestCookie1=value

There is no easy way to distinguish between TestCookie and TestCookie1. It would be better if page.cookies output 'TestCookie=value;TestCookie1=value'. This way each cookie could be identified with a simple split.

Is there anyway I can implement this fix now for my own use?

Ryan

Redirect fails silently - need debug messages

Howdy. We just had a big debugging session centered around redirects, and it turned out that they were redirecting from non-www to a www.domain URL, so spidr silently failed, finding that it was a different host, which is true, but not intuitive.

May we suggest that there be an option to include a logger and logger level, then you could log "page (url) prevented by host filter". That would be super duper nice.

uninitialized constant Spidr::Headers::Set

The following failed when just trying to require 'spidr'

ruby-1.8.7-p330@spidr/gems/spidr-0.3.0/lib/spidr/headers.rb:4: uninitialized constant Spidr::Headers::Set

I am using rvm, with ruby 1.8.7-p330

It also failed for ruby 1.9.2.

trailing slashes

Hi there,

There is problem with trailing slashes in url's, eg. the http://foo.bar.com and http://foo.bar.com/ are unnecessery treated as different urls.

Cheers!

Is there a way to set Accept-Encoding headers?

have a site to spider - https://www.logility.com but its failing on:
ruby/2.2.0/net/http/response.rb:377:in `inflate': incorrect header check (Zlib::DataError)

If I set Accept-Encoding: plain it should work apparently (it then works via open-uri anyway).

fetch_titles not following 301

I'm using the following code to fetch titles on a site

def fetch_titles site
  Enumerator.new do |enum|
    Spidr.site(site) do |spider|
      spider.every_html_page do |page|
        enum.yield page.title
      end
    end
  end
end


fetch_titles('http://site.tld').each do |site|
  p site
end

I'm getting a lot of "301 Moved Permanently" for page.title because Spidr is requesting http://site.tld/~page instead of http://site.tld/~page/.

Is there any way to tell spider to append a / to the URI or follow 301s automatically?

undefined method `attr' for #<Nokogiri::XML::Element:0xb77423c4> (NoMethodError)

ruby 1.8.7 (2010-01-10 patchlevel 249) [i486-linux]
Spidr 0.2.5

I kept getting this error;

/usr/lib/ruby/gems/1.8/gems/spidr-0.2.5/lib/spidr/page.rb:537:in meta_redirect': undefined methodattr' for #Nokogiri::XML::Element:0xb77423c4 (NoMethodError)

The error is caused by the meta_redirect function using the wrong Nikogiri syntax.

Line 537:
node.attr('http-equiv')
replace with
'node.attributes['http-equiv']'

Line 538:
node.attr('content')
replace with
'node.attributes['content']'

Ryan

Spidering pages with no content-type header

We ran into a scenario where we tried to spider a customer's site for certain keywords -- keywords that were present when we viewed the site in a browser -- but could not locate any of them by using some flavor of page.search('//body').text.include?(keyword). Ultimately, page.search('//body') returned an empty array because this customer's web server is not returning a content-type header, and is thus not being parsed into HTML or XML.

What are your thoughts on attempting to parse pages which have no content-type header as HTML? This matches the behavior of current web browsers, and at first glance makes this spider more intuitive to use. I'm happy to work on it, but I may be missing a compelling reason to simply ignore such pages.

Malformed mailto error

The Spidr gem seems to not handle malformed mailto addresses properly. I can't see this problem arising very often, but it did in my case.

Here is the HTML that is causing the error, (notice the commas instead of dots):

< a href="mailto:user@example,org,uk" > [email protected] < /a >

The error:

/usr/lib/ruby/1.8/uri/generic.rb:732:in merge': unrecognised opaque part for mailtoURL: user@example,org,uk (URI::InvalidComponentError) from /usr/lib/ruby/gems/1.8/gems/spidr-0.2.5/lib/spidr/page.rb:509:into_absolute'
from /usr/lib/ruby/gems/1.8/gems/spidr-0.2.5/lib/spidr/page.rb:495:in urls' from /usr/lib/ruby/gems/1.8/gems/spidr-0.2.5/lib/spidr/page.rb:495:inmap'
from /usr/lib/ruby/gems/1.8/gems/spidr-0.2.5/lib/spidr/page.rb:495:in urls' from /usr/lib/ruby/gems/1.8/gems/spidr-0.2.5/lib/spidr/agent.rb:587:invisit_page'
from /usr/lib/ruby/gems/1.8/gems/spidr-0.2.5/lib/spidr/agent.rb:513:in get_page' from /usr/lib/ruby/gems/1.8/gems/spidr-0.2.5/lib/spidr/agent.rb:678:inprepare_request'
from /usr/lib/ruby/gems/1.8/gems/spidr-0.2.5/lib/spidr/agent.rb:507:in get_page' from /usr/lib/ruby/gems/1.8/gems/spidr-0.2.5/lib/spidr/agent.rb:573:invisit_page'
from /usr/lib/ruby/gems/1.8/gems/spidr-0.2.5/lib/spidr/agent.rb:244:in run' from /usr/lib/ruby/gems/1.8/gems/spidr-0.2.5/lib/spidr/agent.rb:226:instart_at'
from /usr/lib/ruby/gems/1.8/gems/spidr-0.2.5/lib/spidr/agent.rb:197:in site' from /usr/lib/ruby/gems/1.8/gems/spidr-0.2.5/lib/spidr/agent.rb:124:ininitialize'
from /usr/lib/ruby/gems/1.8/gems/spidr-0.2.5/lib/spidr/agent.rb:194:in new' from /usr/lib/ruby/gems/1.8/gems/spidr-0.2.5/lib/spidr/agent.rb:194:insite'
from /usr/lib/ruby/gems/1.8/gems/spidr-0.2.5/lib/spidr/spidr.rb:96:in `site'

Is it possible to display only part of a spidered URL?

I only want to display part of a spidered URL, but is this possible?

I've spidered the website www.parkers.co.uk and got the correct URLs, but I don't want to displayed the www.parkers.co.uk part of the URL.

For example, instead of;
http://www.parkers.co.uk/cars/prices/
http://www.parkers.co.uk/cars/leasing/
http://www.parkers.co.uk/vans/for-sale/

I just want to display;
/cars/prices/
/cars/leasing/
/vans/for-sale/

thanks!

Improve network connection to HTTPS server via HTTPSProxy.

In this case, every GET request produce a CONNECT request and a GET request, which is very slow.
So if you add a "session.start" in method SessionCache#[] around line 89, you will have only one CONNECT just when you create the session.

So in File lib/spidr/session_cache.rb you will found :

84 if url.scheme == 'https'
85 session.use_ssl = true
86 session.verify_mode = OpenSSL::SSL::VERIFY_NONE
87 end
88
+89 session.start
90 @sessions[key] = session
91 end

  return @sessions[key]
end

Regards

Limit crawl to links matching pattern

Hi,

I want to crawl a website and reduce the time to crawl, so I'm trying to limit the pages to be crawled to only those i really need. To do so, I'd like to implement 2 rules:

Limit to 10 pages max
Limit to links where anchor text match a regex

Is it possible with Spidr ?

Thanks a lot,
Brice

Anyway to limit the total number of pages crawled or shutdown the crawler after some criteria?

how to login via submit a form

static course specs

I'm working on a patch, and trying to augment the course-related specs with some other real-world examples. However, when I copy the static/course directory to a public webserver and change the COURSE_URL in spec/helpers/course.rb to reflect that, I get a number of spec failures (16 to be exact).

Is this expected? Are there particular rewrite rules or something that aren't codified in the static directory I'd need in order to replicate this?

Add low-level HTTP request methods

Add get, head, post, put, etc methods to Spidr::Agent for when you do not want a Page object returned, just the raw response.