Giter VIP home page Giter VIP logo

spidr's People

Contributors

buren avatar davidsauntson avatar joshcheek avatar justfalter avatar kyaroch avatar maccman avatar nu7hatch avatar petergoldstein avatar postmodern avatar rfletcher avatar spk avatar tricknotes avatar zapnap avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

spidr's Issues

Multithreading

Hi there,

I was wondering if it would be possible to multithread the spidr gem? I don't know much about multithreading in ruby, but I believe only Ruby 1.9.x is able to do so?

I had a look through the source but couldn't find where the spidr gem makes its http requests.

Maybe something like Typhoeus can be used?! (http://rubygems.org/gems/typhoeus)

Thanks,
Ryan

catching SSLErrors

First off thanks for the great work on spidr! Second, I noticed that spidr kinda dies on sites that completely fail at SSL. For instance https://36pizza.com causes spidr to crash with an ssl error. This can be resolved by adding OpenSSL::SSL::SSLError to the rescue clause on lib/agent.rb:684. Right now I don't have time to fork and make a pull request so I'm monkeypatching but I thought you should know.

Crawling a specific page

Hey! I was wondering if there was a way to return all links found on a specific page. So far spidr has been great for crawling a whole site but with my testing I'd like to be able to focus on one page.

Thanks

Following redirects

Howdy! Just wondering if i'm implementing this right. I need to follow redirects, and there doesnt seem to be an option toggle so I tried implementing it this way. It seems to work, but would like some feedback!

Spidr.site(@url, max_depth: 2, limit: 20) do |spider|
  spider.every_redirect_page do |page|
    spider.visit_hosts << URI.parse(page.location).host
    spider.enqueue page.location
  end
end

Problem building gem

When trying to build the gem for the edge version of this library I get the following message:

dwilliams@lists:~/src/spidr$ rake gem
(in /home/dwilliams/src/spidr)
rake aborted!
undefined method `yard_opts' for #<Hoe:0xb78fb5e4>
/home/dwilliams/src/spidr/Rakefile:14
(See full trace by running task with --trace)
dwilliams@lists:~/src/spidr$ rake gem
(in /home/dwilliams/src/spidr)
rake aborted!
undefined method `yard_opts' for #<Hoe:0xb78c0430>
/home/dwilliams/src/spidr/Rakefile:14
(See full trace by running task with --trace)`

and the trace is this:

dwilliams@lists:~/src/spidr$ rake --trace gem
(in /home/dwilliams/src/spidr)
rake aborted!
undefined method `yard_opts' for #<Hoe:0xb79330ac>
/home/dwilliams/src/spidr/Rakefile:14
/usr/lib/ruby/gems/1.8/gems/hoe-2.5.0/lib/hoe.rb:292:in `instance_eval'
/usr/lib/ruby/gems/1.8/gems/hoe-2.5.0/lib/hoe.rb:292:in `spec'
/home/dwilliams/src/spidr/Rakefile:9
/usr/lib/ruby/gems/1.8/gems/rake-0.8.7/lib/rake.rb:2383:in `load'
/usr/lib/ruby/gems/1.8/gems/rake-0.8.7/lib/rake.rb:2383:in `raw_load_rakefile'
/usr/lib/ruby/gems/1.8/gems/rake-0.8.7/lib/rake.rb:2017:in `load_rakefile'
/usr/lib/ruby/gems/1.8/gems/rake-0.8.7/lib/rake.rb:2068:in `standard_exception_handling'
/usr/lib/ruby/gems/1.8/gems/rake-0.8.7/lib/rake.rb:2016:in `load_rakefile'
/usr/lib/ruby/gems/1.8/gems/rake-0.8.7/lib/rake.rb:2000:in `run'
/usr/lib/ruby/gems/1.8/gems/rake-0.8.7/lib/rake.rb:2068:in `standard_exception_handling'
/usr/lib/ruby/gems/1.8/gems/rake-0.8.7/lib/rake.rb:1998:in `run'
/usr/lib/ruby/gems/1.8/gems/rake-0.8.7/bin/rake:31
/usr/bin/rake:19:in `load'
/usr/bin/rake:19

What obvious thing am I missing this time? I have noticed that I get this issue with a few of your other libraries as well.

expected absolute path component: sites/ftp.apache.org/

Get this error while spidering http://apache.org.

The URL that breaks the spider seems to be this one:
www.mirrorservice.org/sites/ftp.apache.org/ (looks although theres a domain when infact its a path)

/usr/lib/ruby/1.8/uri/generic.rb:475:in check_path': bad component(expected absolute path component): sites/ftp.apache.org/ (URI::InvalidComponentError) from /usr/lib/ruby/1.8/uri/generic.rb:495:inpath='
from /var/lib/gems/1.8/gems/spidr-0.2.7/lib/spidr/page.rb:537:in
to_absolute' from /var/lib/gems/1.8/gems/spidr-0.2.7/lib/spidr/page.rb:514:inurls'
from /var/lib/gems/1.8/gems/spidr-0.2.7/lib/spidr/page.rb:514:in map' from /var/lib/gems/1.8/gems/spidr-0.2.7/lib/spidr/page.rb:514:inurls'
from /var/lib/gems/1.8/gems/spidr-0.2.7/lib/spidr/agent.rb:587:in
visit_page' from /var/lib/gems/1.8/gems/spidr-0.2.7/lib/spidr/agent.rb:513:in get_page'
from /var/lib/gems/1.8/gems/spidr-0.2.7/lib/spidr/agent.rb:678:in
prepare_request' from /var/lib/gems/1.8/gems/spidr-0.2.7/lib/spidr/agent.rb:507:in get_page'
from /var/lib/gems/1.8/gems/spidr-0.2.7/lib/spidr/agent.rb:573:in
visit_page' from /var/lib/gems/1.8/gems/spidr-0.2.7/lib/spidr/agent.rb:244:inrun'
from /var/lib/gems/1.8/gems/spidr-0.2.7/lib/spidr/agent.rb:226:in
start_at' from /var/lib/gems/1.8/gems/spidr-0.2.7/lib/spidr/agent.rb:197:insite'
from /var/lib/gems/1.8/gems/spidr-0.2.7/lib/spidr/agent.rb:124:in
initialize' from /var/lib/gems/1.8/gems/spidr-0.2.7/lib/spidr/agent.rb:194:innew'
from /var/lib/gems/1.8/gems/spidr-0.2.7/lib/spidr/agent.rb:194:in site' from /var/lib/gems/1.8/gems/spidr-0.2.7/lib/spidr/spidr.rb:96:insite'

I will do some further investigating and see if I can come up with a fix.

ryan

URL Normalization.

Hey postmodern,

It seems like the usage of File.expand_path in Spidr::Page#to_absolute can goof up URLs in a very minor way. Observe:

irb(main):001:0> a = '/somedir/'
=> "/somedir/"
irb(main):002:0> File.expand_path(a)
=> "/somedir"

Imagine that you go to a site 'http://www.foo.com/somedir' ... '/somedir' is a directory, and the server responds with:

HTTP/1.1 301 Moved Permanently
Date: Mon, 21 Sep 2009 23:08:19 GMT
Server: Apache/2.0.63 (CentOS)
Location: http://www.foo.com/somedir/
.....

Requesting 'http://www.foo.com/somedir/' yields
HTTP/1.1 200 OK
Date: Mon, 21 Sep 2009 23:10:32 GMT
Server: Apache/2.0.63 (CentOS)
.....

When to_absolute normalizes 'http://www.foo.com/somedir/', it ends up coming out of the method as 'http://www.foo.com/somedir', which it has already visited.

In the real world 'http://www.foo.com/somedir' != 'http://www.foo.com/somedir/' ... File.expand_path doesn't know the difference between the two, but to an HTTP server they are two different things.

~Mike

Spidr behaves oddly on bad host-names under Ruby 1.8.7-p249

Just noticed unusual exceptions coming from Net::HTTP when running Spidr on the WSOC under Ruby 1.8.7-p249.

NoMethodError in 'Spidr::Agent before(:all)'
undefined method `closed?' for nil:NilClass
/usr/lib64/ruby/1.8/net/http.rb:1060:in `request'
/usr/lib64/ruby/1.8/net/http.rb:772:in `get'
/underground/code/spidr/spec/helpers/wsoc.rb:69:in `run_course'
./spec/agent_spec.rb:10:

Link depth?

Hi,

I am sure I saw somewhere that 'start_at' accepted a 'depth' parameter. I can't seem to find any reference to it any longer. Either way if I wasn't just imagining things, the depth option doesn't seem to be working. I grepped through the code and couldn't find it either.

Here is a pastie of a test spider of my blog:
http://pastie.org/1754398

Is there a depth option? If so, am I using it wrong? or is it not working?

Thanks a million for any help. :)

Page#to_absolut raises URI::InvalidURIError: path conflicts with opaque

Thanks for an awesome gem! ๐ŸŒŸ

When crawling a site, this exception was raised:

URI::InvalidURIError: path conflicts with opaque
from $HOME/.rubies/ruby-2.4.0/lib/ruby/2.4.0/uri/generic.rb:761:in `check_path'
from $HOME/.rubies/ruby-2.4.0/lib/ruby/2.4.0/uri/generic.rb:817:in `path='
from $HOME/.gem/ruby/2.4.0/gems/spidr-0.6.0/lib/spidr/page/html.rb:283:in `to_absolute'
from $HOME/.gem/ruby/2.4.0/gems/spidr-0.6.0/lib/spidr/page/html.rb:239:in `block in each_url'
from $HOME/.gem/ruby/2.4.0/gems/spidr-0.6.0/lib/spidr/page/html.rb:182:in `block in each_link'
from $HOME/.gem/ruby/2.4.0/gems/spidr-0.6.0/lib/spidr/page/html.rb:189:in `block in each_link'
from $HOME/.gem/ruby/2.4.0/gems/nokogiri-1.7.0.1/lib/nokogiri/xml/node_set.rb:187:in `block in each'
from $HOME/.gem/ruby/2.4.0/gems/nokogiri-1.7.0.1/lib/nokogiri/xml/node_set.rb:186:in `upto'
from $HOME/.gem/ruby/2.4.0/gems/nokogiri-1.7.0.1/lib/nokogiri/xml/node_set.rb:186:in `each'
from $HOME/.gem/ruby/2.4.0/gems/spidr-0.6.0/lib/spidr/page/html.rb:188:in `each_link'
from $HOME/.gem/ruby/2.4.0/gems/spidr-0.6.0/lib/spidr/page/html.rb:238:in `each_url'
from $HOME/.gem/ruby/2.4.0/gems/spidr-0.6.0/lib/spidr/agent.rb:679:in `block in visit_page'
from $HOME/.gem/ruby/2.4.0/gems/spidr-0.6.0/lib/spidr/agent.rb:605:in `block in get_page'
from $HOME/.gem/ruby/2.4.0/gems/spidr-0.6.0/lib/spidr/agent.rb:788:in `prepare_request'
from $HOME/.gem/ruby/2.4.0/gems/spidr-0.6.0/lib/spidr/agent.rb:599:in `get_page'
from $HOME/.gem/ruby/2.4.0/gems/spidr-0.6.0/lib/spidr/agent.rb:665:in `visit_page'

The line in spidr that raises this error is lib/spidr/page/html.rb:283:

new_url.path = URI.expand_path(path)

URI#path= calls URI#check_path which raises the error, see ruby docs for URI:Generic#check_path.

I'm not really sure what the best way to go about this would be, perhaps catching URI::InvalidURIError and returning nil is sensible, since nil already can be returned from Page#to_absolute?

Skip processing of pages

In the documentation says that is possible to skip processing some pages, but I can not find how I can do it, I have tried with ignore_links or ignore_pages but nothing sames to work, eg:

spider = Spidr.site('.....', ignore_links: [%{^/blog/}]) do |spider|
spider.every_html_page do |page|
//here I still get pages with the /blog url
end
end

How I can ignore some pages based in the URL?

`ignore_links` not working.

Hello. I am loving this library! But I have an issue.

I am collecting URLS of already scraped pages inside an array and for later continuation of the process am using ignore_links to skip them.

However, it's not working. The URLs are collected by page.url and are then fed into ignore_links later on as absolute URL strings. The page I am scraping references it's content by relative links.

linkregs = [] # regexes, working fine
ignore = [] # read from file
Spidr.start_at("http://example.com", links: linkregs, ignore_links: ignore) do |spidr|
  spidr.every_page do |page|
    if ignore.include?(page.url.to_s)
      # this is the problem
      puts "Error!!"
    end
    ignore.push(page.url.to_s)
  end
end
# save ignore to file

path conflicts with opaque (URI::InvalidURIError)

I'm trying to crawl stackoverflow but the crawler keeps on giving me this error. apparently the problem is happening whenever it reaches the following link

I'm not sure how to fix it. since
"subject=Stack%20Overflow%20Question&body=Time%20series%20speed%20forecasting%20using%20regression%20with%20exogenous%20variables%0Ahttps%3a%2f%2fstackoverflow.com%2fq%2f49618734%3fsem%3d2"

Traceback (most recent call last): 21: from main.rb:4:in

'
20: from /Users/mustafakhalil/Projects/Senior/crawler/crawler.rb:20:in start_crawling' 19: from /usr/local/lib/ruby/gems/2.5.0/gems/spidr-0.6.0/lib/spidr/spidr.rb:53:in site'
18: from /usr/local/lib/ruby/gems/2.5.0/gems/spidr-0.6.0/lib/spidr/agent.rb:274:in site' 17: from /usr/local/lib/ruby/gems/2.5.0/gems/spidr-0.6.0/lib/spidr/agent.rb:355:in start_at'
16: from /usr/local/lib/ruby/gems/2.5.0/gems/spidr-0.6.0/lib/spidr/agent.rb:373:in run' 15: from /usr/local/lib/ruby/gems/2.5.0/gems/spidr-0.6.0/lib/spidr/agent.rb:665:in visit_page'
14: from /usr/local/lib/ruby/gems/2.5.0/gems/spidr-0.6.0/lib/spidr/agent.rb:599:in get_page' 13: from /usr/local/lib/ruby/gems/2.5.0/gems/spidr-0.6.0/lib/spidr/agent.rb:788:in prepare_request'
12: from /usr/local/lib/ruby/gems/2.5.0/gems/spidr-0.6.0/lib/spidr/agent.rb:605:in block in get_page' 11: from /usr/local/lib/ruby/gems/2.5.0/gems/spidr-0.6.0/lib/spidr/agent.rb:679:in block in visit_page'
10: from /usr/local/lib/ruby/gems/2.5.0/gems/spidr-0.6.0/lib/spidr/page/html.rb:238:in each_url' 9: from /usr/local/lib/ruby/gems/2.5.0/gems/spidr-0.6.0/lib/spidr/page/html.rb:188:in each_link'
8: from /usr/local/lib/ruby/gems/2.5.0/gems/nokogiri-1.8.2/lib/nokogiri/xml/node_set.rb:189:in each' 7: from /usr/local/lib/ruby/gems/2.5.0/gems/nokogiri-1.8.2/lib/nokogiri/xml/node_set.rb:189:in upto'
6: from /usr/local/lib/ruby/gems/2.5.0/gems/nokogiri-1.8.2/lib/nokogiri/xml/node_set.rb:190:in block in each' 5: from /usr/local/lib/ruby/gems/2.5.0/gems/spidr-0.6.0/lib/spidr/page/html.rb:189:in block in each_link'
4: from /usr/local/lib/ruby/gems/2.5.0/gems/spidr-0.6.0/lib/spidr/page/html.rb:182:in block in each_link' 3: from /usr/local/lib/ruby/gems/2.5.0/gems/spidr-0.6.0/lib/spidr/page/html.rb:239:in block in each_url'
2: from /usr/local/lib/ruby/gems/2.5.0/gems/spidr-0.6.0/lib/spidr/page/html.rb:283:in to_absolute' 1: from /usr/local/Cellar/ruby/2.5.0_2/lib/ruby/2.5.0/uri/generic.rb:822:in path='
/usr/local/Cellar/ruby/2.5.0_2/lib/ruby/2.5.0/uri/generic.rb:766:in check_path': path conflicts with opaque (URI::InvalidURIError)

How can I 'ignore everything except' a set of links

In the Examples section, I've got the 'Do not spider certain links' operation to work.

In my case 'Spidr.site('http://www.parkers.co.uk/', :ignore_links => [/vans/])

Correctly displays all URLs except those starting with /vans/.

However, I really need to also 'ignore' all links EXCEPT for /vans/, so that ONLY /vans/ urls are displayed.

Is this possible?

There are too many possibilities to just add to the list of 'ignore_links', I really need to 'ignore everything except' for /vans/.

thanks!

Thank you

I'm opening this issue for the sole reason to say, thank you so much for your hard work ๐Ÿ™.

also as a side note, I run the specs against ruby 2.5.3 using latest RubyGems 3 and bundler 2, and minor updates of the dependencies and all the specs were green ๐Ÿ’š.

Session handling

Is it possible for Spidr to follow some setup param like visiting a page and logging in before spidering the remaining page?

Passing Headers to Net::HTTP

in agent.rb:483 it seems like you are passing path and headers to Net::HTTP.get(), but according to the [[url:http://ruby-doc.org/stdlib/libdoc/net/http/rdoc/classes/Net/HTTP.html#M000655|documentation]] this method only takes host, path, and port.

I am trying to add the addition client header "Host" but am obviously missing something here in your implementation. Can you clarify?

The use case is having host be an ip address, but the Host header being the domain name allowing spidr-ing of sites that arn't in dns but hosted through vhost.

Infinite path loop

Hi,

It seems that when $_SERVER['REQUEST_URI'] or similar is used AND the web server is configured to return custom error pages (including 200 statuses), Spidr ends up in an infinite loop.

In this particular case the problem URL is in a POST form action element, but I don't think it matters where the URL appears.

Eventually ends up with pages like so:

http://www.example.com/dir/js/js/js/js/js/js/js/js/js/js/js/js/js/js/js/js/js/js/js/js/js/js/js/js/js/js/js/js/js/js/js/js/js/js/js/js/js/js/js/js/js/js/js/js/js/js/js/js/js/js/js/js/js/js/js/js/js/js/js/js/js/js/js/js/js/js/js/js/js/js/js/js/js/js/js/js/js/somefile.js

I'm not sure how this could be solved, the depth option may help cut down on the false positive URLs but wouldn't solve the problem.

Thanks,
Ryan

99% of cpu usage while crawling bigger websites...

Hello, when i'm crawling quite big websites (with :depth => 5) spidr eats all my server's resources o_O". I'm using it in many threads (within eventmachine) so I think it can be something wrong with threads safety. Any ideas?

SSL session reuse may fail

I've just run into a situation where the reuse of an SSL session caused an exception and Spidr subsequently skipped the page. Currently, the exception is silently swallowed, so I modified it to grab the following trace:

EOFError (end of file reached):
  /home/nirvdrum/.rvm/rubies/ruby-1.9.3-p0/lib/ruby/1.9.1/openssl/buffering.rb:174:in `sysread_nonblock'
  /home/nirvdrum/.rvm/rubies/ruby-1.9.3-p0/lib/ruby/1.9.1/openssl/buffering.rb:174:in `read_nonblock'
  /home/nirvdrum/.rvm/rubies/ruby-1.9.3-p0/lib/ruby/1.9.1/net/protocol.rb:141:in `rbuf_fill'
  /home/nirvdrum/.rvm/rubies/ruby-1.9.3-p0/lib/ruby/1.9.1/net/protocol.rb:122:in `readuntil'
  /home/nirvdrum/.rvm/rubies/ruby-1.9.3-p0/lib/ruby/1.9.1/net/protocol.rb:132:in `readline'
  /home/nirvdrum/.rvm/rubies/ruby-1.9.3-p0/lib/ruby/1.9.1/net/http.rb:2562:in `read_status_line'
  /home/nirvdrum/.rvm/rubies/ruby-1.9.3-p0/lib/ruby/1.9.1/net/http.rb:2551:in `read_new'
  /home/nirvdrum/.rvm/rubies/ruby-1.9.3-p0/lib/ruby/1.9.1/net/http.rb:1319:in `block in transport_request'
  /home/nirvdrum/.rvm/rubies/ruby-1.9.3-p0/lib/ruby/1.9.1/net/http.rb:1316:in `catch'
  /home/nirvdrum/.rvm/rubies/ruby-1.9.3-p0/lib/ruby/1.9.1/net/http.rb:1316:in `transport_request'
  /home/nirvdrum/.rvm/rubies/ruby-1.9.3-p0/lib/ruby/1.9.1/net/http.rb:1293:in `request'
  rest-client (1.6.7) lib/restclient/net_http_ext.rb:51:in `request'
  /home/nirvdrum/.rvm/rubies/ruby-1.9.3-p0/lib/ruby/1.9.1/net/http.rb:1026:in `get'
  spidr (0.4.1) lib/spidr/agent.rb:513:in `block in get_page'
  spidr (0.4.1) lib/spidr/agent.rb:684:in `prepare_request'
  spidr (0.4.1) lib/spidr/agent.rb:512:in `get_page'
  app/models/cookie_login_option.rb:150:in `fetch_remote_form'
  app/models/cookie_login_option.rb:158:in `block in fetch_remote_form'
  spidr (0.4.1) lib/spidr/agent.rb:518:in `block in get_page'
  spidr (0.4.1) lib/spidr/agent.rb:684:in `prepare_request'
  spidr (0.4.1) lib/spidr/agent.rb:512:in `get_page'
  app/models/cookie_login_option.rb:150:in `fetch_remote_form'
  app/models/cookie_login_option.rb:158:in `block in fetch_remote_form'
  spidr (0.4.1) lib/spidr/agent.rb:518:in `block in get_page'
  spidr (0.4.1) lib/spidr/agent.rb:684:in `prepare_request'
  spidr (0.4.1) lib/spidr/agent.rb:512:in `get_page'

If I modify the code to remove the session cache, I am able to fetch the page okay. It might be good to catch EOFError and retry with a new session in the event this happens. Catching the error all over the place could be messy though.

unable to ignore links

cool gem!

I'm trying to ignore going to partners and everything after it in my site www.mysite.com/partners/resellers and is still going to those links.

   root = args[:url]

    url_map = Hash.new { |hash,key| hash[key] = [] }

    spider = Spidr.site(root, ignore_links_like: [%{^/partners/}]) do |spider|
      spider.every_url { |url| puts (url) }
      spider.every_failed_url { |url| puts "Failed url #{url}" }
      spider.every_link do |origin,dest|
        url_map[dest] << origin
      end
    end

    spider.failures.each do |url|
      puts "Broken link #{url} found in:"

      url_map[url].each { |page| puts ("  #{page}").red }
    end

#<NoMethodError: undefined method `closed?' for nil:NilClass>

Hi.

I got the following error while spidering a site. I suspect it was because the remote site dropped the connection, however I am unsure.

<NoMethodError: undefined method `closed?' for nil:NilClass>

/usr/lib/ruby/1.8/net/http.rb:1060:in request' /usr/lib/ruby/1.8/net/http.rb:772:inget'
/usr/lib/ruby/gems/1.8/gems/spidr-0.3.1/lib/spidr/agent.rb:521:in get_page' /usr/lib/ruby/gems/1.8/gems/spidr-0.3.1/lib/spidr/agent.rb:693:inprepare_request'
/usr/lib/ruby/gems/1.8/gems/spidr-0.3.1/lib/spidr/agent.rb:520:in get_page' /usr/lib/ruby/gems/1.8/gems/spidr-0.3.1/lib/spidr/agent.rb:586:invisit_page'
/usr/lib/ruby/gems/1.8/gems/spidr-0.3.1/lib/spidr/agent.rb:256:in run' /usr/lib/ruby/gems/1.8/gems/spidr-0.3.1/lib/spidr/agent.rb:238:instart_at'
/usr/lib/ruby/gems/1.8/gems/spidr-0.3.1/lib/spidr/agent.rb:209:in site' /usr/lib/ruby/gems/1.8/gems/spidr-0.3.1/lib/spidr/agent.rb:136:ininitialize'
/usr/lib/ruby/gems/1.8/gems/spidr-0.3.1/lib/spidr/agent.rb:206:in new' /usr/lib/ruby/gems/1.8/gems/spidr-0.3.1/lib/spidr/agent.rb:206:insite'
/usr/lib/ruby/gems/1.8/gems/spidr-0.3.1/lib/spidr/spidr.rb:96:in `site'

I was thinking that a possible solution would be to wrap the agent.rb get_page method contents in a begin/rescue block, report the error, but carry on, or maybe try again?

def get_page(url)
url = URI(url.to_s)

  begin
  prepare_request(url) do |session,path,headers|
    new_page = Page.new(url,session.get(path,headers))

    # save any new cookies
    @cookies.from_page(new_page)

    yield new_page if block_given?
    return new_page
  end
 rescue => e
      puts '+++++ ERROR IN SPIDR GEM ' + e.inspect
      return ''
 end
end

SSL and network, in general

A couple of things

  1. From everything I can tell about the net/https , you need to set 'use_ssl' to true when you want things to run over SSL. You'd also want to turn off certificate validation, as since we are spidering, we really don't care about the validity, just connectivity. Ex:
    http = Net::HTTP.new(host,port)
    http.use_ssl = true
    http.verify_mode = OpenSSL::SSL::VERIFY_NONE
  2. You really, really, really need to re-use your HTTP session objects once they are created. Right now, Agent will create and destroy a full connection for each request. With SSL is involed, the overhead for negotiating an SSL connection will kill a CPU.
    You can easily share your sessions if you create a hash that is keyed off of [host,port]. If you run into an error while making an HTTP request, destroy the session and recreate it. You'll be able to run hundreds, if not thousands of HTTP requests over a single TCP connection.

Odd Net::HTTP error when requesting the last page.

Noticed this exception from within Net::HTTP when requesting the last page:

NoMethodError: undefined method `closed?' for nil:NilClass
    from /usr/lib64/ruby/1.8/net/http.rb:1060:in `request'
    from /usr/lib64/ruby/1.8/net/http.rb:772:in `get'
    from ./lib/spidr/agent.rb:501
    from ./lib/spidr/agent.rb:671:in `call'
    from ./lib/spidr/agent.rb:671:in `prepare_request'
    from ./lib/spidr/agent.rb:500:in `get_page'
    from ./lib/spidr/agent.rb:566:in `visit_page'
    from ./lib/spidr/agent.rb:244:in `run'
    from ./lib/spidr/agent.rb:226:in `start_at'
    from ./lib/spidr/agent.rb:171
    from ./lib/spidr/agent.rb:124:in `call'
    from ./lib/spidr/agent.rb:124:in `initialize'
    from ./lib/spidr/agent.rb:168:in `new'
    from ./lib/spidr/agent.rb:168:in `host'
    from ./lib/spidr/spidr.rb:89:in `host'
    from (irb):2

Multiple cookies from same page

Hi,
This is probably more of a feature request than a bug.

If a page sets more than one cookie;

if (!isset ($_COOKIE["TestCookie"])) {
setcookie("TestCookie", "value");
}

if (!isset ($_COOKIE["TestCookie1"])) {
setcookie("TestCookie1", "value");
}

echo $_COOKIE["TestCookie"];
echo $_COOKIE["TestCookie1"];

?>

Spidr returns;

puts page.cookies.to_s # => TestCookie=valueTestCookie1=value

There is no easy way to distinguish between TestCookie and TestCookie1. It would be better if page.cookies output 'TestCookie=value;TestCookie1=value'. This way each cookie could be identified with a simple split.

Is there anyway I can implement this fix now for my own use?

Ryan

Redirect fails silently - need debug messages

Howdy. We just had a big debugging session centered around redirects, and it turned out that they were redirecting from non-www to a www.domain URL, so spidr silently failed, finding that it was a different host, which is true, but not intuitive.

May we suggest that there be an option to include a logger and logger level, then you could log "page (url) prevented by host filter". That would be super duper nice.

uninitialized constant Spidr::Headers::Set

The following failed when just trying to require 'spidr'

ruby-1.8.7-p330@spidr/gems/spidr-0.3.0/lib/spidr/headers.rb:4: uninitialized constant Spidr::Headers::Set

I am using rvm, with ruby 1.8.7-p330

It also failed for ruby 1.9.2.

fetch_titles not following 301

I'm using the following code to fetch titles on a site

def fetch_titles site
  Enumerator.new do |enum|
    Spidr.site(site) do |spider|
      spider.every_html_page do |page|
        enum.yield page.title
      end
    end
  end
end


fetch_titles('http://site.tld').each do |site|
  p site
end

I'm getting a lot of "301 Moved Permanently" for page.title because Spidr is requesting http://site.tld/~page instead of http://site.tld/~page/.

Is there any way to tell spider to append a / to the URI or follow 301s automatically?

undefined method `attr' for #<Nokogiri::XML::Element:0xb77423c4> (NoMethodError)

ruby 1.8.7 (2010-01-10 patchlevel 249) [i486-linux]
Spidr 0.2.5

I kept getting this error;

/usr/lib/ruby/gems/1.8/gems/spidr-0.2.5/lib/spidr/page.rb:537:in meta_redirect': undefined methodattr' for #Nokogiri::XML::Element:0xb77423c4 (NoMethodError)

The error is caused by the meta_redirect function using the wrong Nikogiri syntax.

Line 537:
node.attr('http-equiv')
replace with
'node.attributes['http-equiv']'

Line 538:
node.attr('content')
replace with
'node.attributes['content']'

Ryan

Spidering pages with no content-type header

We ran into a scenario where we tried to spider a customer's site for certain keywords -- keywords that were present when we viewed the site in a browser -- but could not locate any of them by using some flavor of page.search('//body').text.include?(keyword). Ultimately, page.search('//body') returned an empty array because this customer's web server is not returning a content-type header, and is thus not being parsed into HTML or XML.

What are your thoughts on attempting to parse pages which have no content-type header as HTML? This matches the behavior of current web browsers, and at first glance makes this spider more intuitive to use. I'm happy to work on it, but I may be missing a compelling reason to simply ignore such pages.

Malformed mailto error

The Spidr gem seems to not handle malformed mailto addresses properly. I can't see this problem arising very often, but it did in my case.

  • Here is the HTML that is causing the error, (notice the commas instead of dots):

< a href="mailto:user@example,org,uk" > [email protected] < /a >

  • The error:

/usr/lib/ruby/1.8/uri/generic.rb:732:in merge': unrecognised opaque part for mailtoURL: user@example,org,uk (URI::InvalidComponentError) from /usr/lib/ruby/gems/1.8/gems/spidr-0.2.5/lib/spidr/page.rb:509:into_absolute'
from /usr/lib/ruby/gems/1.8/gems/spidr-0.2.5/lib/spidr/page.rb:495:in urls' from /usr/lib/ruby/gems/1.8/gems/spidr-0.2.5/lib/spidr/page.rb:495:inmap'
from /usr/lib/ruby/gems/1.8/gems/spidr-0.2.5/lib/spidr/page.rb:495:in urls' from /usr/lib/ruby/gems/1.8/gems/spidr-0.2.5/lib/spidr/agent.rb:587:invisit_page'
from /usr/lib/ruby/gems/1.8/gems/spidr-0.2.5/lib/spidr/agent.rb:513:in get_page' from /usr/lib/ruby/gems/1.8/gems/spidr-0.2.5/lib/spidr/agent.rb:678:inprepare_request'
from /usr/lib/ruby/gems/1.8/gems/spidr-0.2.5/lib/spidr/agent.rb:507:in get_page' from /usr/lib/ruby/gems/1.8/gems/spidr-0.2.5/lib/spidr/agent.rb:573:invisit_page'
from /usr/lib/ruby/gems/1.8/gems/spidr-0.2.5/lib/spidr/agent.rb:244:in run' from /usr/lib/ruby/gems/1.8/gems/spidr-0.2.5/lib/spidr/agent.rb:226:instart_at'
from /usr/lib/ruby/gems/1.8/gems/spidr-0.2.5/lib/spidr/agent.rb:197:in site' from /usr/lib/ruby/gems/1.8/gems/spidr-0.2.5/lib/spidr/agent.rb:124:ininitialize'
from /usr/lib/ruby/gems/1.8/gems/spidr-0.2.5/lib/spidr/agent.rb:194:in new' from /usr/lib/ruby/gems/1.8/gems/spidr-0.2.5/lib/spidr/agent.rb:194:insite'
from /usr/lib/ruby/gems/1.8/gems/spidr-0.2.5/lib/spidr/spidr.rb:96:in `site'

Improve network connection to HTTPS server via HTTPSProxy.

In this case, every GET request produce a CONNECT request and a GET request, which is very slow.
So if you add a "session.start" in method SessionCache#[] around line 89, you will have only one CONNECT just when you create the session.

So in File lib/spidr/session_cache.rb you will found :

84 if url.scheme == 'https'
85 session.use_ssl = true
86 session.verify_mode = OpenSSL::SSL::VERIFY_NONE
87 end
88
+89 session.start
90 @sessions[key] = session
91 end

  return @sessions[key]
end

Regards

Limit crawl to links matching pattern

Hi,

I want to crawl a website and reduce the time to crawl, so I'm trying to limit the pages to be crawled to only those i really need. To do so, I'd like to implement 2 rules:

  1. Limit to 10 pages max
  2. Limit to links where anchor text match a regex

Is it possible with Spidr ?

Thanks a lot,
Brice

static course specs

I'm working on a patch, and trying to augment the course-related specs with some other real-world examples. However, when I copy the static/course directory to a public webserver and change the COURSE_URL in spec/helpers/course.rb to reflect that, I get a number of spec failures (16 to be exact).

Is this expected? Are there particular rewrite rules or something that aren't codified in the static directory I'd need in order to replicate this?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.