postmodern / spidr Goto Github PK
View Code? Open in Web Editor NEWA versatile Ruby web spidering library that can spider a site, multiple domains, certain links or infinitely. Spidr is designed to be fast and easy to use.
License: MIT License
A versatile Ruby web spidering library that can spider a site, multiple domains, certain links or infinitely. Spidr is designed to be fast and easy to use.
License: MIT License
Hi there,
I was wondering if it would be possible to multithread the spidr gem? I don't know much about multithreading in ruby, but I believe only Ruby 1.9.x is able to do so?
I had a look through the source but couldn't find where the spidr gem makes its http requests.
Maybe something like Typhoeus can be used?! (http://rubygems.org/gems/typhoeus)
Thanks,
Ryan
First off thanks for the great work on spidr! Second, I noticed that spidr kinda dies on sites that completely fail at SSL. For instance https://36pizza.com causes spidr to crash with an ssl error. This can be resolved by adding OpenSSL::SSL::SSLError to the rescue clause on lib/agent.rb:684. Right now I don't have time to fork and make a pull request so I'm monkeypatching but I thought you should know.
Hey! I was wondering if there was a way to return all links found on a specific page. So far spidr has been great for crawling a whole site but with my testing I'd like to be able to focus on one page.
Thanks
Add methods/options for filtering URLs by path.
Howdy! Just wondering if i'm implementing this right. I need to follow redirects, and there doesnt seem to be an option toggle so I tried implementing it this way. It seems to work, but would like some feedback!
Spidr.site(@url, max_depth: 2, limit: 20) do |spider|
spider.every_redirect_page do |page|
spider.visit_hosts << URI.parse(page.location).host
spider.enqueue page.location
end
end
When trying to build the gem for the edge version of this library I get the following message:
dwilliams@lists:~/src/spidr$ rake gem
(in /home/dwilliams/src/spidr)
rake aborted!
undefined method `yard_opts' for #<Hoe:0xb78fb5e4>
/home/dwilliams/src/spidr/Rakefile:14
(See full trace by running task with --trace)
dwilliams@lists:~/src/spidr$ rake gem
(in /home/dwilliams/src/spidr)
rake aborted!
undefined method `yard_opts' for #<Hoe:0xb78c0430>
/home/dwilliams/src/spidr/Rakefile:14
(See full trace by running task with --trace)`
and the trace is this:
dwilliams@lists:~/src/spidr$ rake --trace gem
(in /home/dwilliams/src/spidr)
rake aborted!
undefined method `yard_opts' for #<Hoe:0xb79330ac>
/home/dwilliams/src/spidr/Rakefile:14
/usr/lib/ruby/gems/1.8/gems/hoe-2.5.0/lib/hoe.rb:292:in `instance_eval'
/usr/lib/ruby/gems/1.8/gems/hoe-2.5.0/lib/hoe.rb:292:in `spec'
/home/dwilliams/src/spidr/Rakefile:9
/usr/lib/ruby/gems/1.8/gems/rake-0.8.7/lib/rake.rb:2383:in `load'
/usr/lib/ruby/gems/1.8/gems/rake-0.8.7/lib/rake.rb:2383:in `raw_load_rakefile'
/usr/lib/ruby/gems/1.8/gems/rake-0.8.7/lib/rake.rb:2017:in `load_rakefile'
/usr/lib/ruby/gems/1.8/gems/rake-0.8.7/lib/rake.rb:2068:in `standard_exception_handling'
/usr/lib/ruby/gems/1.8/gems/rake-0.8.7/lib/rake.rb:2016:in `load_rakefile'
/usr/lib/ruby/gems/1.8/gems/rake-0.8.7/lib/rake.rb:2000:in `run'
/usr/lib/ruby/gems/1.8/gems/rake-0.8.7/lib/rake.rb:2068:in `standard_exception_handling'
/usr/lib/ruby/gems/1.8/gems/rake-0.8.7/lib/rake.rb:1998:in `run'
/usr/lib/ruby/gems/1.8/gems/rake-0.8.7/bin/rake:31
/usr/bin/rake:19:in `load'
/usr/bin/rake:19
What obvious thing am I missing this time? I have noticed that I get this issue with a few of your other libraries as well.
Discussed in IRC the other day. Noting it here for posterity. Could look into using http://github.com/alexdunae/css_parser for this, although there may be a more efficient path.
parser = CssParser::Parser.new
parser.load_uri!(uri)
parser.loaded_uris
=> [uri, imported_uri_1, imported_uri_2, etc]
If site have http basic auth then spidr is failed.
First page is http://user:[email protected], but inside pages have http://example.com/*.
http://user:[email protected] have link to http://user:[email protected]/foo
spidr get http://example.com/foo, but not http://user:[email protected]/foo and return 401 error.
Why?
Get this error while spidering http://apache.org.
The URL that breaks the spider seems to be this one:
www.mirrorservice.org/sites/ftp.apache.org/ (looks although theres a domain when infact its a path)
/usr/lib/ruby/1.8/uri/generic.rb:475:in check_path': bad component(expected absolute path component): sites/ftp.apache.org/ (URI::InvalidComponentError) from /usr/lib/ruby/1.8/uri/generic.rb:495:in
path='
from /var/lib/gems/1.8/gems/spidr-0.2.7/lib/spidr/page.rb:537:in
to_absolute' from /var/lib/gems/1.8/gems/spidr-0.2.7/lib/spidr/page.rb:514:in
urls'
from /var/lib/gems/1.8/gems/spidr-0.2.7/lib/spidr/page.rb:514:in map' from /var/lib/gems/1.8/gems/spidr-0.2.7/lib/spidr/page.rb:514:in
urls'
from /var/lib/gems/1.8/gems/spidr-0.2.7/lib/spidr/agent.rb:587:in
visit_page' from /var/lib/gems/1.8/gems/spidr-0.2.7/lib/spidr/agent.rb:513:in
get_page'
from /var/lib/gems/1.8/gems/spidr-0.2.7/lib/spidr/agent.rb:678:in
prepare_request' from /var/lib/gems/1.8/gems/spidr-0.2.7/lib/spidr/agent.rb:507:in
get_page'
from /var/lib/gems/1.8/gems/spidr-0.2.7/lib/spidr/agent.rb:573:in
visit_page' from /var/lib/gems/1.8/gems/spidr-0.2.7/lib/spidr/agent.rb:244:in
run'
from /var/lib/gems/1.8/gems/spidr-0.2.7/lib/spidr/agent.rb:226:in
start_at' from /var/lib/gems/1.8/gems/spidr-0.2.7/lib/spidr/agent.rb:197:in
site'
from /var/lib/gems/1.8/gems/spidr-0.2.7/lib/spidr/agent.rb:124:in
initialize' from /var/lib/gems/1.8/gems/spidr-0.2.7/lib/spidr/agent.rb:194:in
new'
from /var/lib/gems/1.8/gems/spidr-0.2.7/lib/spidr/agent.rb:194:in site' from /var/lib/gems/1.8/gems/spidr-0.2.7/lib/spidr/spidr.rb:96:in
site'
I will do some further investigating and see if I can come up with a fix.
ryan
Hey postmodern,
It seems like the usage of File.expand_path in Spidr::Page#to_absolute can goof up URLs in a very minor way. Observe:
irb(main):001:0> a = '/somedir/'
=> "/somedir/"
irb(main):002:0> File.expand_path(a)
=> "/somedir"
Imagine that you go to a site 'http://www.foo.com/somedir' ... '/somedir' is a directory, and the server responds with:
HTTP/1.1 301 Moved Permanently
Date: Mon, 21 Sep 2009 23:08:19 GMT
Server: Apache/2.0.63 (CentOS)
Location: http://www.foo.com/somedir/
.....
Requesting 'http://www.foo.com/somedir/' yields
HTTP/1.1 200 OK
Date: Mon, 21 Sep 2009 23:10:32 GMT
Server: Apache/2.0.63 (CentOS)
.....
When to_absolute normalizes 'http://www.foo.com/somedir/', it ends up coming out of the method as 'http://www.foo.com/somedir', which it has already visited.
In the real world 'http://www.foo.com/somedir' != 'http://www.foo.com/somedir/' ... File.expand_path doesn't know the difference between the two, but to an HTTP server they are two different things.
~Mike
Just noticed unusual exceptions coming from Net::HTTP when running Spidr on the WSOC under Ruby 1.8.7-p249.
NoMethodError in 'Spidr::Agent before(:all)'
undefined method `closed?' for nil:NilClass
/usr/lib64/ruby/1.8/net/http.rb:1060:in `request'
/usr/lib64/ruby/1.8/net/http.rb:772:in `get'
/underground/code/spidr/spec/helpers/wsoc.rb:69:in `run_course'
./spec/agent_spec.rb:10:
Hi,
i think there is some problem with every_html_page
. It's unnecessarily process javascript files in my app.
Thanks in advice!
Automatically detecting and parsing /sitemap.xml
might be a good way to cut down on spidering depth.
Hi,
I am sure I saw somewhere that 'start_at' accepted a 'depth' parameter. I can't seem to find any reference to it any longer. Either way if I wasn't just imagining things, the depth option doesn't seem to be working. I grepped through the code and couldn't find it either.
Here is a pastie of a test spider of my blog:
http://pastie.org/1754398
Is there a depth option? If so, am I using it wrong? or is it not working?
Thanks a million for any help. :)
Thanks for an awesome gem! ๐
When crawling a site, this exception was raised:
URI::InvalidURIError: path conflicts with opaque
from $HOME/.rubies/ruby-2.4.0/lib/ruby/2.4.0/uri/generic.rb:761:in `check_path'
from $HOME/.rubies/ruby-2.4.0/lib/ruby/2.4.0/uri/generic.rb:817:in `path='
from $HOME/.gem/ruby/2.4.0/gems/spidr-0.6.0/lib/spidr/page/html.rb:283:in `to_absolute'
from $HOME/.gem/ruby/2.4.0/gems/spidr-0.6.0/lib/spidr/page/html.rb:239:in `block in each_url'
from $HOME/.gem/ruby/2.4.0/gems/spidr-0.6.0/lib/spidr/page/html.rb:182:in `block in each_link'
from $HOME/.gem/ruby/2.4.0/gems/spidr-0.6.0/lib/spidr/page/html.rb:189:in `block in each_link'
from $HOME/.gem/ruby/2.4.0/gems/nokogiri-1.7.0.1/lib/nokogiri/xml/node_set.rb:187:in `block in each'
from $HOME/.gem/ruby/2.4.0/gems/nokogiri-1.7.0.1/lib/nokogiri/xml/node_set.rb:186:in `upto'
from $HOME/.gem/ruby/2.4.0/gems/nokogiri-1.7.0.1/lib/nokogiri/xml/node_set.rb:186:in `each'
from $HOME/.gem/ruby/2.4.0/gems/spidr-0.6.0/lib/spidr/page/html.rb:188:in `each_link'
from $HOME/.gem/ruby/2.4.0/gems/spidr-0.6.0/lib/spidr/page/html.rb:238:in `each_url'
from $HOME/.gem/ruby/2.4.0/gems/spidr-0.6.0/lib/spidr/agent.rb:679:in `block in visit_page'
from $HOME/.gem/ruby/2.4.0/gems/spidr-0.6.0/lib/spidr/agent.rb:605:in `block in get_page'
from $HOME/.gem/ruby/2.4.0/gems/spidr-0.6.0/lib/spidr/agent.rb:788:in `prepare_request'
from $HOME/.gem/ruby/2.4.0/gems/spidr-0.6.0/lib/spidr/agent.rb:599:in `get_page'
from $HOME/.gem/ruby/2.4.0/gems/spidr-0.6.0/lib/spidr/agent.rb:665:in `visit_page'
The line in spidr
that raises this error is lib/spidr/page/html.rb:283:
new_url.path = URI.expand_path(path)
URI#path=
calls URI#check_path
which raises the error, see ruby docs for URI:Generic#check_path.
I'm not really sure what the best way to go about this would be, perhaps catching URI::InvalidURIError
and returning nil
is sensible, since nil
already can be returned from Page#to_absolute
?
In the documentation says that is possible to skip processing some pages, but I can not find how I can do it, I have tried with ignore_links
or ignore_pages
but nothing sames to work, eg:
spider = Spidr.site('.....', ignore_links: [%{^/blog/}]) do |spider|
spider.every_html_page do |page|
//here I still get pages with the /blog url
end
end
How I can ignore some pages based in the URL?
I guess I could load the robots.txt file of a site but is there a way to turn this on so that it will always follow the rules?
Hello. I am loving this library! But I have an issue.
I am collecting URLS of already scraped pages inside an array and for later continuation of the process am using ignore_links
to skip them.
However, it's not working. The URLs are collected by page.url
and are then fed into ignore_links
later on as absolute URL strings. The page I am scraping references it's content by relative links.
linkregs = [] # regexes, working fine
ignore = [] # read from file
Spidr.start_at("http://example.com", links: linkregs, ignore_links: ignore) do |spidr|
spidr.every_page do |page|
if ignore.include?(page.url.to_s)
# this is the problem
puts "Error!!"
end
ignore.push(page.url.to_s)
end
end
# save ignore to file
I'm trying to crawl stackoverflow but the crawler keeps on giving me this error. apparently the problem is happening whenever it reaches the following link
I'm not sure how to fix it. since
"subject=Stack%20Overflow%20Question&body=Time%20series%20speed%20forecasting%20using%20regression%20with%20exogenous%20variables%0Ahttps%3a%2f%2fstackoverflow.com%2fq%2f49618734%3fsem%3d2"
Traceback (most recent call last): 21: from main.rb:4:in
start_crawling' 19: from /usr/local/lib/ruby/gems/2.5.0/gems/spidr-0.6.0/lib/spidr/spidr.rb:53:in
site'site' 17: from /usr/local/lib/ruby/gems/2.5.0/gems/spidr-0.6.0/lib/spidr/agent.rb:355:in
start_at'run' 15: from /usr/local/lib/ruby/gems/2.5.0/gems/spidr-0.6.0/lib/spidr/agent.rb:665:in
visit_page'get_page' 13: from /usr/local/lib/ruby/gems/2.5.0/gems/spidr-0.6.0/lib/spidr/agent.rb:788:in
prepare_request'block in get_page' 11: from /usr/local/lib/ruby/gems/2.5.0/gems/spidr-0.6.0/lib/spidr/agent.rb:679:in
block in visit_page'each_url' 9: from /usr/local/lib/ruby/gems/2.5.0/gems/spidr-0.6.0/lib/spidr/page/html.rb:188:in
each_link'each' 7: from /usr/local/lib/ruby/gems/2.5.0/gems/nokogiri-1.8.2/lib/nokogiri/xml/node_set.rb:189:in
upto'block in each' 5: from /usr/local/lib/ruby/gems/2.5.0/gems/spidr-0.6.0/lib/spidr/page/html.rb:189:in
block in each_link'block in each_link' 3: from /usr/local/lib/ruby/gems/2.5.0/gems/spidr-0.6.0/lib/spidr/page/html.rb:239:in
block in each_url'to_absolute' 1: from /usr/local/Cellar/ruby/2.5.0_2/lib/ruby/2.5.0/uri/generic.rb:822:in
path='check_path': path conflicts with opaque (URI::InvalidURIError)
In the Examples section, I've got the 'Do not spider certain links' operation to work.
In my case 'Spidr.site('http://www.parkers.co.uk/', :ignore_links => [/vans/])
Correctly displays all URLs except those starting with /vans/.
However, I really need to also 'ignore' all links EXCEPT for /vans/, so that ONLY /vans/ urls are displayed.
Is this possible?
There are too many possibilities to just add to the list of 'ignore_links', I really need to 'ignore everything except' for /vans/.
thanks!
I'm opening this issue for the sole reason to say, thank you so much for your hard work ๐.
also as a side note, I run the specs against ruby 2.5.3 using latest RubyGems 3 and bundler 2, and minor updates of the dependencies and all the specs were green ๐.
To reduce lookup time in the Spidr::Agent#queue
, we can store the URLs in a Hash of the unique host:port
pair and the URL paths. This will also facilitate events for when new hosts are seen.
Is it possible for Spidr to follow some setup param like visiting a page and logging in before spidering the remaining page?
Rewrite the module as a shared_example and do not use eval
.
in agent.rb:483 it seems like you are passing path and headers to Net::HTTP.get(), but according to the [[url:http://ruby-doc.org/stdlib/libdoc/net/http/rdoc/classes/Net/HTTP.html#M000655|documentation]] this method only takes host, path, and port.
I am trying to add the addition client header "Host" but am obviously missing something here in your implementation. Can you clarify?
The use case is having host be an ip address, but the Host header being the domain name allowing spidr-ing of sites that arn't in dns but hosted through vhost.
Hi,
It seems that when $_SERVER['REQUEST_URI'] or similar is used AND the web server is configured to return custom error pages (including 200 statuses), Spidr ends up in an infinite loop.
In this particular case the problem URL is in a POST form action element, but I don't think it matters where the URL appears.
Eventually ends up with pages like so:
I'm not sure how this could be solved, the depth option may help cut down on the false positive URLs but wouldn't solve the problem.
Thanks,
Ryan
Hello, when i'm crawling quite big websites (with :depth => 5) spidr eats all my server's resources o_O". I'm using it in many threads (within eventmachine) so I think it can be something wrong with threads safety. Any ideas?
I've just run into a situation where the reuse of an SSL session caused an exception and Spidr subsequently skipped the page. Currently, the exception is silently swallowed, so I modified it to grab the following trace:
EOFError (end of file reached): /home/nirvdrum/.rvm/rubies/ruby-1.9.3-p0/lib/ruby/1.9.1/openssl/buffering.rb:174:in `sysread_nonblock' /home/nirvdrum/.rvm/rubies/ruby-1.9.3-p0/lib/ruby/1.9.1/openssl/buffering.rb:174:in `read_nonblock' /home/nirvdrum/.rvm/rubies/ruby-1.9.3-p0/lib/ruby/1.9.1/net/protocol.rb:141:in `rbuf_fill' /home/nirvdrum/.rvm/rubies/ruby-1.9.3-p0/lib/ruby/1.9.1/net/protocol.rb:122:in `readuntil' /home/nirvdrum/.rvm/rubies/ruby-1.9.3-p0/lib/ruby/1.9.1/net/protocol.rb:132:in `readline' /home/nirvdrum/.rvm/rubies/ruby-1.9.3-p0/lib/ruby/1.9.1/net/http.rb:2562:in `read_status_line' /home/nirvdrum/.rvm/rubies/ruby-1.9.3-p0/lib/ruby/1.9.1/net/http.rb:2551:in `read_new' /home/nirvdrum/.rvm/rubies/ruby-1.9.3-p0/lib/ruby/1.9.1/net/http.rb:1319:in `block in transport_request' /home/nirvdrum/.rvm/rubies/ruby-1.9.3-p0/lib/ruby/1.9.1/net/http.rb:1316:in `catch' /home/nirvdrum/.rvm/rubies/ruby-1.9.3-p0/lib/ruby/1.9.1/net/http.rb:1316:in `transport_request' /home/nirvdrum/.rvm/rubies/ruby-1.9.3-p0/lib/ruby/1.9.1/net/http.rb:1293:in `request' rest-client (1.6.7) lib/restclient/net_http_ext.rb:51:in `request' /home/nirvdrum/.rvm/rubies/ruby-1.9.3-p0/lib/ruby/1.9.1/net/http.rb:1026:in `get' spidr (0.4.1) lib/spidr/agent.rb:513:in `block in get_page' spidr (0.4.1) lib/spidr/agent.rb:684:in `prepare_request' spidr (0.4.1) lib/spidr/agent.rb:512:in `get_page' app/models/cookie_login_option.rb:150:in `fetch_remote_form' app/models/cookie_login_option.rb:158:in `block in fetch_remote_form' spidr (0.4.1) lib/spidr/agent.rb:518:in `block in get_page' spidr (0.4.1) lib/spidr/agent.rb:684:in `prepare_request' spidr (0.4.1) lib/spidr/agent.rb:512:in `get_page' app/models/cookie_login_option.rb:150:in `fetch_remote_form' app/models/cookie_login_option.rb:158:in `block in fetch_remote_form' spidr (0.4.1) lib/spidr/agent.rb:518:in `block in get_page' spidr (0.4.1) lib/spidr/agent.rb:684:in `prepare_request' spidr (0.4.1) lib/spidr/agent.rb:512:in `get_page'
If I modify the code to remove the session cache, I am able to fetch the page okay. It might be good to catch EOFError and retry with a new session in the event this happens. Catching the error all over the place could be messy though.
Make wsoc a dependency of spidr and route all requests to it using webmock's .to_rack
method. https://robots.thoughtbot.com/how-to-stub-external-services-in-tests
cool gem!
I'm trying to ignore going to partners and everything after it in my site www.mysite.com/partners/resellers and is still going to those links.
root = args[:url]
url_map = Hash.new { |hash,key| hash[key] = [] }
spider = Spidr.site(root, ignore_links_like: [%{^/partners/}]) do |spider|
spider.every_url { |url| puts (url) }
spider.every_failed_url { |url| puts "Failed url #{url}" }
spider.every_link do |origin,dest|
url_map[dest] << origin
end
end
spider.failures.each do |url|
puts "Broken link #{url} found in:"
url_map[url].each { |page| puts (" #{page}").red }
end
Hi.
I got the following error while spidering a site. I suspect it was because the remote site dropped the connection, however I am unsure.
/usr/lib/ruby/1.8/net/http.rb:1060:in request' /usr/lib/ruby/1.8/net/http.rb:772:in
get'
/usr/lib/ruby/gems/1.8/gems/spidr-0.3.1/lib/spidr/agent.rb:521:in get_page' /usr/lib/ruby/gems/1.8/gems/spidr-0.3.1/lib/spidr/agent.rb:693:in
prepare_request'
/usr/lib/ruby/gems/1.8/gems/spidr-0.3.1/lib/spidr/agent.rb:520:in get_page' /usr/lib/ruby/gems/1.8/gems/spidr-0.3.1/lib/spidr/agent.rb:586:in
visit_page'
/usr/lib/ruby/gems/1.8/gems/spidr-0.3.1/lib/spidr/agent.rb:256:in run' /usr/lib/ruby/gems/1.8/gems/spidr-0.3.1/lib/spidr/agent.rb:238:in
start_at'
/usr/lib/ruby/gems/1.8/gems/spidr-0.3.1/lib/spidr/agent.rb:209:in site' /usr/lib/ruby/gems/1.8/gems/spidr-0.3.1/lib/spidr/agent.rb:136:in
initialize'
/usr/lib/ruby/gems/1.8/gems/spidr-0.3.1/lib/spidr/agent.rb:206:in new' /usr/lib/ruby/gems/1.8/gems/spidr-0.3.1/lib/spidr/agent.rb:206:in
site'
/usr/lib/ruby/gems/1.8/gems/spidr-0.3.1/lib/spidr/spidr.rb:96:in `site'
I was thinking that a possible solution would be to wrap the agent.rb get_page method contents in a begin/rescue block, report the error, but carry on, or maybe try again?
def get_page(url)
url = URI(url.to_s)
begin
prepare_request(url) do |session,path,headers|
new_page = Page.new(url,session.get(path,headers))
# save any new cookies
@cookies.from_page(new_page)
yield new_page if block_given?
return new_page
end
rescue => e
puts '+++++ ERROR IN SPIDR GEM ' + e.inspect
return ''
end
end
A couple of things
Noticed this exception from within Net::HTTP when requesting the last page:
NoMethodError: undefined method `closed?' for nil:NilClass
from /usr/lib64/ruby/1.8/net/http.rb:1060:in `request'
from /usr/lib64/ruby/1.8/net/http.rb:772:in `get'
from ./lib/spidr/agent.rb:501
from ./lib/spidr/agent.rb:671:in `call'
from ./lib/spidr/agent.rb:671:in `prepare_request'
from ./lib/spidr/agent.rb:500:in `get_page'
from ./lib/spidr/agent.rb:566:in `visit_page'
from ./lib/spidr/agent.rb:244:in `run'
from ./lib/spidr/agent.rb:226:in `start_at'
from ./lib/spidr/agent.rb:171
from ./lib/spidr/agent.rb:124:in `call'
from ./lib/spidr/agent.rb:124:in `initialize'
from ./lib/spidr/agent.rb:168:in `new'
from ./lib/spidr/agent.rb:168:in `host'
from ./lib/spidr/spidr.rb:89:in `host'
from (irb):2
Found an edge-case in URI.expand_path
. /../foo
expands to just foo
, which is not an absolute path.
Hi,
This is probably more of a feature request than a bug.
If a page sets more than one cookie;
if (!isset ($_COOKIE["TestCookie"])) {
setcookie("TestCookie", "value");
}
if (!isset ($_COOKIE["TestCookie1"])) {
setcookie("TestCookie1", "value");
}
echo $_COOKIE["TestCookie"];
echo $_COOKIE["TestCookie1"];
?>
Spidr returns;
puts page.cookies.to_s # => TestCookie=valueTestCookie1=value
There is no easy way to distinguish between TestCookie and TestCookie1. It would be better if page.cookies output 'TestCookie=value;TestCookie1=value'. This way each cookie could be identified with a simple split.
Is there anyway I can implement this fix now for my own use?
Ryan
Howdy. We just had a big debugging session centered around redirects, and it turned out that they were redirecting from non-www to a www.domain URL, so spidr silently failed, finding that it was a different host, which is true, but not intuitive.
May we suggest that there be an option to include a logger and logger level, then you could log "page (url) prevented by host filter". That would be super duper nice.
The following failed when just trying to require 'spidr'
ruby-1.8.7-p330@spidr/gems/spidr-0.3.0/lib/spidr/headers.rb:4: uninitialized constant Spidr::Headers::Set
I am using rvm, with ruby 1.8.7-p330
It also failed for ruby 1.9.2.
Hi there,
There is problem with trailing slashes in url's, eg. the http://foo.bar.com and http://foo.bar.com/ are unnecessery treated as different urls.
Cheers!
have a site to spider - https://www.logility.com but its failing on:
ruby/2.2.0/net/http/response.rb:377:in `inflate': incorrect header check (Zlib::DataError)
If I set Accept-Encoding: plain it should work apparently (it then works via open-uri anyway).
I'm using the following code to fetch titles on a site
def fetch_titles site
Enumerator.new do |enum|
Spidr.site(site) do |spider|
spider.every_html_page do |page|
enum.yield page.title
end
end
end
end
fetch_titles('http://site.tld').each do |site|
p site
end
I'm getting a lot of "301 Moved Permanently"
for page.title because Spidr is requesting http://site.tld/~page
instead of http://site.tld/~page/
.
Is there any way to tell spider to append a /
to the URI or follow 301s automatically?
ruby 1.8.7 (2010-01-10 patchlevel 249) [i486-linux]
Spidr 0.2.5
I kept getting this error;
/usr/lib/ruby/gems/1.8/gems/spidr-0.2.5/lib/spidr/page.rb:537:in meta_redirect': undefined method
attr' for #Nokogiri::XML::Element:0xb77423c4 (NoMethodError)
The error is caused by the meta_redirect function using the wrong Nikogiri syntax.
Line 537:
node.attr('http-equiv')
replace with
'node.attributes['http-equiv']'
Line 538:
node.attr('content')
replace with
'node.attributes['content']'
Ryan
We ran into a scenario where we tried to spider a customer's site for certain keywords -- keywords that were present when we viewed the site in a browser -- but could not locate any of them by using some flavor of page.search('//body').text.include?(keyword)
. Ultimately, page.search('//body')
returned an empty array because this customer's web server is not returning a content-type
header, and is thus not being parsed into HTML or XML.
What are your thoughts on attempting to parse pages which have no content-type
header as HTML? This matches the behavior of current web browsers, and at first glance makes this spider more intuitive to use. I'm happy to work on it, but I may be missing a compelling reason to simply ignore such pages.
The Spidr gem seems to not handle malformed mailto addresses properly. I can't see this problem arising very often, but it did in my case.
< a href="mailto:user@example,org,uk" > [email protected] < /a >
/usr/lib/ruby/1.8/uri/generic.rb:732:in merge': unrecognised opaque part for mailtoURL: user@example,org,uk (URI::InvalidComponentError) from /usr/lib/ruby/gems/1.8/gems/spidr-0.2.5/lib/spidr/page.rb:509:in
to_absolute'
from /usr/lib/ruby/gems/1.8/gems/spidr-0.2.5/lib/spidr/page.rb:495:in urls' from /usr/lib/ruby/gems/1.8/gems/spidr-0.2.5/lib/spidr/page.rb:495:in
map'
from /usr/lib/ruby/gems/1.8/gems/spidr-0.2.5/lib/spidr/page.rb:495:in urls' from /usr/lib/ruby/gems/1.8/gems/spidr-0.2.5/lib/spidr/agent.rb:587:in
visit_page'
from /usr/lib/ruby/gems/1.8/gems/spidr-0.2.5/lib/spidr/agent.rb:513:in get_page' from /usr/lib/ruby/gems/1.8/gems/spidr-0.2.5/lib/spidr/agent.rb:678:in
prepare_request'
from /usr/lib/ruby/gems/1.8/gems/spidr-0.2.5/lib/spidr/agent.rb:507:in get_page' from /usr/lib/ruby/gems/1.8/gems/spidr-0.2.5/lib/spidr/agent.rb:573:in
visit_page'
from /usr/lib/ruby/gems/1.8/gems/spidr-0.2.5/lib/spidr/agent.rb:244:in run' from /usr/lib/ruby/gems/1.8/gems/spidr-0.2.5/lib/spidr/agent.rb:226:in
start_at'
from /usr/lib/ruby/gems/1.8/gems/spidr-0.2.5/lib/spidr/agent.rb:197:in site' from /usr/lib/ruby/gems/1.8/gems/spidr-0.2.5/lib/spidr/agent.rb:124:in
initialize'
from /usr/lib/ruby/gems/1.8/gems/spidr-0.2.5/lib/spidr/agent.rb:194:in new' from /usr/lib/ruby/gems/1.8/gems/spidr-0.2.5/lib/spidr/agent.rb:194:in
site'
from /usr/lib/ruby/gems/1.8/gems/spidr-0.2.5/lib/spidr/spidr.rb:96:in `site'
I only want to display part of a spidered URL, but is this possible?
I've spidered the website www.parkers.co.uk and got the correct URLs, but I don't want to displayed the www.parkers.co.uk part of the URL.
For example, instead of;
http://www.parkers.co.uk/cars/prices/
http://www.parkers.co.uk/cars/leasing/
http://www.parkers.co.uk/vans/for-sale/
I just want to display;
/cars/prices/
/cars/leasing/
/vans/for-sale/
thanks!
In this case, every GET request produce a CONNECT request and a GET request, which is very slow.
So if you add a "session.start" in method SessionCache#[] around line 89, you will have only one CONNECT just when you create the session.
So in File lib/spidr/session_cache.rb you will found :
84 if url.scheme == 'https'
85 session.use_ssl = true
86 session.verify_mode = OpenSSL::SSL::VERIFY_NONE
87 end
88
+89 session.start
90 @sessions[key] = session
91 end
return @sessions[key]
end
Regards
Hi,
I want to crawl a website and reduce the time to crawl, so I'm trying to limit the pages to be crawled to only those i really need. To do so, I'd like to implement 2 rules:
Is it possible with Spidr ?
Thanks a lot,
Brice
how to login via submit a form
I'm working on a patch, and trying to augment the course-related specs with some other real-world examples. However, when I copy the static/course directory to a public webserver and change the COURSE_URL in spec/helpers/course.rb to reflect that, I get a number of spec failures (16 to be exact).
Is this expected? Are there particular rewrite rules or something that aren't codified in the static directory I'd need in order to replicate this?
Add get
, head
, post
, put
, etc methods to Spidr::Agent
for when you do not want a Page object returned, just the raw response.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.