gurgeous / sinew Goto Github PK
View Code? Open in Web Editor NEWA Ruby DSL for structured web crawling, with a robust caching system.
License: MIT License
A Ruby DSL for structured web crawling, with a robust caching system.
License: MIT License
RubyGems.org doesn't report a license for your gem. This is because it is not specified in the gemspec of your last release.
via e.g.
spec.license = 'MIT'
# or
spec.licenses = ['MIT', 'GPL-2']
Including a license in your gemspec is an easy way for rubygems.org and other tools to check how your gem is licensed. As you can imagine, scanning your repository for a LICENSE file or parsing the README, and then attempting to identify the license or licenses is much more difficult and more error prone. So, even for projects that already specify a license, including a license in your gemspec is a good practice. See, for example, how rubygems.org uses the gemspec to display the rails gem license.
There is even a License Finder gem to help companies/individuals ensure all gems they use meet their licensing needs. This tool depends on license information being available in the gemspec. This is an important enough issue that even Bundler now generates gems with a default 'MIT' license.
I hope you'll consider specifying a license in your gemspec. If not, please just close the issue with a nice message. In either case, I'll follow up. Thanks for your time!
Appendix:
If you need help choosing a license (sorry, I haven't checked your readme or looked for a license file), GitHub has created a license picker tool. Code without a license specified defaults to 'All rights reserved'-- denying others all rights to use of the code.
Here's a list of the license names I've found and their frequencies
p.s. In case you're wondering how I found you and why I made this issue, it's because I'm collecting stats on gems (I was originally looking for download data) and decided to collect license metadata,too, and make issues for gemspecs not specifying a license as a public service :). See the previous link or my blog post about this project for more information.
Hi! Sinew's approach to scraping is exactly what I'm after. Trouble is, after installing the gem and running the Amazon example, the following error returned. Have you seen this kind of thing before? Personally, I have not - and google's results for "superclass mismatch for class DateTime" are uncharacteristically sparse, as far as error-searching goes.
/home/u/.rbenv/versions/2.0.0-p247/lib/ruby/2.0.0/rubygems/core_ext/kernel_require.rb:106:in `require': superclass mismatch for class DateTime (TypeError)
from /home/u/.rbenv/versions/2.0.0-p247/lib/ruby/2.0.0/rubygems/core_ext/kernel_require.rb:106:in `require'
from /home/u/.rbenv/versions/2.0.0-p247/lib/ruby/2.0.0/date.rb:3:in `<top (required)>'
from /home/u/.rbenv/versions/2.0.0-p247/lib/ruby/2.0.0/rubygems/core_ext/kernel_require.rb:106:in `require'
from /home/u/.rbenv/versions/2.0.0-p247/lib/ruby/2.0.0/rubygems/core_ext/kernel_require.rb:106:in `require'
from /home/u/.rbenv/versions/2.0.0-p247/lib/ruby/gems/2.0.0/gems/activesupport-4.0.0/lib/active_support/core_ext/string/conversions.rb:1:in `<top (required)>'
from /home/u/.rbenv/versions/2.0.0-p247/lib/ruby/2.0.0/rubygems/core_ext/kernel_require.rb:58:in `require'
from /home/u/.rbenv/versions/2.0.0-p247/lib/ruby/2.0.0/rubygems/core_ext/kernel_require.rb:58:in `require'
from /home/u/.rbenv/versions/2.0.0-p247/lib/ruby/gems/2.0.0/gems/activesupport-4.0.0/lib/active_support/core_ext/string.rb:1:in `<top (required)>'
from /home/u/.rbenv/versions/2.0.0-p247/lib/ruby/2.0.0/rubygems/core_ext/kernel_require.rb:106:in `require'
from /home/u/.rbenv/versions/2.0.0-p247/lib/ruby/2.0.0/rubygems/core_ext/kernel_require.rb:106:in `require'
from /home/u/.rbenv/versions/2.0.0-p247/lib/ruby/gems/2.0.0/gems/activesupport-4.0.0/lib/active_support/core_ext.rb:3:in `block in <top (required)>'
from /home/u/.rbenv/versions/2.0.0-p247/lib/ruby/gems/2.0.0/gems/activesupport-4.0.0/lib/active_support/core_ext.rb:1:in `each'
from /home/u/.rbenv/versions/2.0.0-p247/lib/ruby/gems/2.0.0/gems/activesupport-4.0.0/lib/active_support/core_ext.rb:1:in `<top (required)>'
from /home/u/.rbenv/versions/2.0.0-p247/lib/ruby/2.0.0/rubygems/core_ext/kernel_require.rb:106:in `require'
from /home/u/.rbenv/versions/2.0.0-p247/lib/ruby/2.0.0/rubygems/core_ext/kernel_require.rb:106:in `require'
from /home/u/.rbenv/versions/2.0.0-p247/lib/ruby/gems/2.0.0/gems/sinew-1.0.3/lib/sinew/text_util.rb:1:in `<top (required)>'
from /home/u/.rbenv/versions/2.0.0-p247/lib/ruby/2.0.0/rubygems/core_ext/kernel_require.rb:58:in `require'
from /home/u/.rbenv/versions/2.0.0-p247/lib/ruby/2.0.0/rubygems/core_ext/kernel_require.rb:58:in `require'
from /home/u/.rbenv/versions/2.0.0-p247/lib/ruby/gems/2.0.0/gems/sinew-1.0.3/lib/sinew.rb:5:in `<top (required)>'
from /home/u/.rbenv/versions/2.0.0-p247/lib/ruby/2.0.0/rubygems/core_ext/kernel_require.rb:58:in `require'
from /home/u/.rbenv/versions/2.0.0-p247/lib/ruby/2.0.0/rubygems/core_ext/kernel_require.rb:58:in `require'
from /home/u/.rbenv/versions/2.0.0-p247/lib/ruby/gems/2.0.0/gems/sinew-1.0.3/bin/sinew:3:in `<top (required)>'
from /home/u/.rbenv/versions/2.0.0-p247/bin/sinew:23:in `load'
from /home/u/.rbenv/versions/2.0.0-p247/bin/sinew:23:in `<main>'
which: no tidy in (/usr/local/rvm/gems/ruby-1.9.2-p320/bin:/usr/local/rvm/gems/ruby-1.9.2-p320@global/bin:/usr/local/rvm/rubies/ruby-1.9.2-p320/bin:/usr/local/rvm/bin:/usr/lib64/qt-3.3/bin:/usr/local/sbin:/usr/local/bin:/sbin:/bin:/usr/sbin:/usr/bin:/root/bin)
[17:57:30] Sorry, Sinew requires tidy. Please install it.
i have installed tidy via gem install tidy
please help
So I talked to the Nokogiri people and it turns out that they have problems getting HTML5 support out of the underlying libxml. The only reliable way right now is to use nokogumbo, which uses Google's gumbo instead. Is there anything that would speak against changing sinew to use that?
My guess is that there is some setup step that you did a long time ago that you forgot to include in the readme.
I describe the problem here: http://stackoverflow.com/questions/11074718/sinew-ruby-web-scraper-example-does-not-work-on-my-machine
Thanks.
It should be possible to use Sinew in other project as normal Ruby class. It is common to use crawler in other project and there should be a way to call it directly.
Now Sinew is shipped only as a binary.
Hi, first off, thanks for a great gem! I'm trying to do some work with Arabic websites and was already able to put the basics together following the docs and examples. Unfortunately, the CSV file contains no Arabic characters at all. E.g., the name of the American state secretary John Kerry, in Arabic "كيري" shows up at "kyry", which is entirely unusable for most scientific purposes. Apparently Sinew got the text off the website just fine, but it's applying some sort of unwanted transliteration to it. Could you give me a hint as to how I could have the original Arabic text (in UTF-8 if at all possible) written to the CSV instead?
Trying to do something that might not have been part of the plan:
def get_article_body(url)
get url
noko.at_css('article').text
end
def get_list_of_articles(url)
get url
noko.css('.articles').each do |div|
path = div.at_css('.header a')[:href]
row[:url] = path
row[:body] = get_article_body("#{url}/#{path}")
csv_emit(row)
end
end
get_list_of_articles("http://www.domain.tld")
Unfortunately only the first call to get
works, after that nothing seems to happen anymore. Am I going about this wrong or is it just not supported?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.