jaimeiniesta / metainspector Goto Github PK

Ruby gem for web scraping purposes. It scrapes a given URL, and returns you its title, meta description, meta keywords, links, images...

Home Page: https://github.com/metainspector/metainspector

License: MIT License

Ruby 100.00%

metainspector's Introduction

MetaInspector

MetaInspector is a gem for web scraping purposes.

You give it an URL, and it lets you easily get its title, links, images, charset, description, keywords, meta tags...

Installation

Install the gem from RubyGems:

gem install metainspector

If you're using it on a Rails application, just add it to your Gemfile and run bundle install

gem 'metainspector'

Supported Ruby versions are defined in .circleci/config.yml.

Usage

Initialize a MetaInspector instance for an URL, like this:

page = MetaInspector.new('http://sitevalidator.com')

If you don't include the scheme on the URL, http:// will be used by default:

page = MetaInspector.new('sitevalidator.com')

You can also include the html which will be used as the document to scrape:

page = MetaInspector.new("http://sitevalidator.com",
                         :document => "<html>...</html>")

Accessing response

You can check the status and headers from the response like this:

page.response.status  # 200
page.response.headers # { "server"=>"nginx", "content-type"=>"text/html; charset=utf-8",
                      #   "cache-control"=>"must-revalidate, private, max-age=0", ... }

Accessing scraped data

URL

page.url                 # URL of the page
page.tracked?            # returns true if the url contains known tracking parameters
page.untracked_url       # returns the url with the known tracking parameters removed
page.untrack!            # removes the known tracking parameters from the url
page.scheme              # Scheme of the page (http, https)
page.host                # Hostname of the page (like, sitevalidator.com, without the scheme)
page.root_url            # Root url (scheme + host, like http://sitevalidator.com/)

Head links

page.head_links          # an array of hashes of all head/links
page.stylesheets         # an array of hashes of all head/links where rel='stylesheet'
page.canonicals          # an array of hashes of all head/links where rel='canonical'
page.feeds               # Get rss or atom links in meta data fields as array of hash in the form { href: "...", title: "...", type: "..." }

Texts

page.title               # title of the page from the head section, as string
page.best_title          # best title of the page, from a selection of candidates
page.author              # author of the page from the meta author tag
page.best_author         # best author of the page, from a selection of candidates
page.description         # returns the meta description
page.best_description    # returns the first non-empty description between the following candidates: standard meta description, og:description, twitter:description, the first long paragraph
page.h1                  # returns h1 text array
page.h2                  # returns h2 text array
page.h3                  # returns h3 text array
page.h4                  # returns h4 text array
page.h5                  # returns h5 text array
page.h6                  # returns h6 text array

Links

page.links.raw           # every link found, unprocessed
page.links.all           # every link found on the page as an absolute URL
page.links.http          # every HTTP link found
page.links.non_http      # every non-HTTP link found
page.links.internal      # every internal link found on the page as an absolute URL
page.links.external      # every external link found on the page as an absolute URL

Images

page.images              # enumerable collection, with every img found on the page as an absolute URL
page.images.with_size    # a sorted array (by descending area) of [image_url, width, height]
page.images.best         # Most relevant image, if defined with the og:image or twitter:image metatags. Fallback to the first page.images array element
page.images.favicon      # absolute URL to the favicon

Meta tags

When it comes to meta tags, you have several options:

page.meta_tags  # Gives you all the meta tags by type:
                # (meta name, meta http-equiv, meta property and meta charset)
                # As meta tags can be repeated (in the case of 'og:image', for example),
                # the values returned will be arrays
                #
                # For example:
                #
                # {
                    'name' => {
                                'keywords'       => ['one, two, three'],
                                'description'    => ['the description'],
                                'author'         => ['Joe Sample'],
                                'robots'         => ['index,follow'],
                                'revisit'        => ['15 days'],
                                'dc.date.issued' => ['2011-09-15']
                              },

                    'http-equiv' => {
                                        'content-type'        => ['text/html; charset=UTF-8'],
                                        'content-style-type'  => ['text/css']
                                    },

                    'property' => {
                                    'og:title'        => ['An OG title'],
                                    'og:type'         => ['website'],
                                    'og:url'          => ['http://example.com/meta-tags'],
                                    'og:image'        => ['http://example.com/rock.jpg',
                                                          'http://example.com/rock2.jpg',
                                                          'http://example.com/rock3.jpg'],
                                    'og:image:width'  => ['300'],
                                    'og:image:height' => ['300', '1000']
                                   },

                    'charset' => ['UTF-8']
                  }

As this method returns a hash, you can also take only the key that you need, like in:

page.meta_tags['property']  # Returns:
                            # {
                            #   'og:title'        => ['An OG title'],
                            #   'og:type'         => ['website'],
                            #   'og:url'          => ['http://example.com/meta-tags'],
                            #   'og:image'        => ['http://example.com/rock.jpg',
                            #                         'http://example.com/rock2.jpg',
                            #                         'http://example.com/rock3.jpg'],
                            #   'og:image:width'  => ['300'],
                            #   'og:image:height' => ['300', '1000']
                            # }

In most cases you will only be interested in the first occurrence of a meta tag, so you can use the singular form of that method:

page.meta_tag['name']   # Returns:
                        # {
                        #   'keywords'       => 'one, two, three',
                        #   'description'    => 'the description',
                        #   'author'         => 'Joe Sample',
                        #   'robots'         => 'index,follow',
                        #   'revisit'        => '15 days',
                        #   'dc.date.issued' => '2011-09-15'
                        # }

Or, as this is also a hash:

page.meta_tag['name']['keywords']    # Returns 'one, two, three'

And finally, you can use the shorter meta method that will merge the different keys so you have a simpler hash:

page.meta   # Returns:
            #
            # {
            #   'keywords'            => 'one, two, three',
            #   'description'         => 'the description',
            #   'author'              => 'Joe Sample',
            #   'robots'              => 'index,follow',
            #   'revisit'             => '15 days',
            #   'dc.date.issued'      => '2011-09-15',
            #   'content-type'        => 'text/html; charset=UTF-8',
            #   'content-style-type'  => 'text/css',
            #   'og:title'            => 'An OG title',
            #   'og:type'             => 'website',
            #   'og:url'              => 'http://example.com/meta-tags',
            #   'og:image'            => 'http://example.com/rock.jpg',
            #   'og:image:width'      => '300',
            #   'og:image:height'     => '300',
            #   'charset'             => 'UTF-8'
            # }

This way, you can get most meta tags just like that:

page.meta['author']     # Returns "Joe Sample"

Please be aware that all keys are converted to downcase, so it's 'dc.date.issued' and not 'DC.date.issued'.

Misc

page.charset             # UTF-8
page.content_type        # content-type returned by the server when the url was requested

Other representations

You can also access most of the scraped data as a hash:

page.to_hash    # { "url"   => "http://sitevalidator.com",
                    "title" => "MarkupValidator :: site-wide markup validation tool", ... }

The original document is accessible from:

page.to_s         # A String with the contents of the HTML document

And the full scraped document is accessible from:

page.parsed  # Nokogiri doc that you can use it to get any element from the page

Options

Forced encoding

If you get a MetaInspector::RequestError, "invalid byte sequence in UTF-8" or similar error, you can try forcing the encoding like this:

page = MetaInspector.new(url, :encoding => 'UTF-8')

Timeout & Retries

You can specify 2 different timeouts when requesting a page:

connection_timeout sets the maximum number of seconds to wait to get a connection to the page.
read_timeout sets the maximum number of seconds to wait to read the page, once connected.

Both timeouts default to 20 seconds each.

You can also specify the number of retries, which defaults to 3.

For example, this will time out after 10 seconds waiting for a connection, or after 5 seconds waiting to read its contents, and will retry 4 times:

page = MetaInspector.new('www.google', :connection_timeout => 10, :read_timeout => 5, :retries => 4)

If MetaInspector fails to fetch the page after it has exhausted its retries, it will raise MetaInspector::TimeoutError, which you can rescue in your application code.

begin
  page = MetaInspector.new(url)
rescue MetaInspector::TimeoutError
  enqueue_for_future_fetch_attempt(url)
  render_simple(url)
else
  render_rich(page)
end

Redirections

By default, MetaInspector will follow redirects (up to a limit of 10).

If you want to disallow redirects, you can do it like this:

page = MetaInspector.new('facebook.com', :allow_redirections => false)

You can also customize how many redirects you wish to allow:

page = MetaInspector.new('facebook.com', :faraday_options => { redirect: { limit: 5 } })

And even customize what to do in between each redirect:

callback = proc do |previous_response, next_request|
  ip_address = Resolv.getaddress(next_request.url.host)
  raise 'Invalid address' if IPAddr.new(ip_address).private?
end

page = MetaInspector.new(url, faraday_options: { redirect: { callback: callback } })

The faraday_options[:redirect] hash is passed to the FollowRedirects middleware used by Faraday, so that we can use all available options. Check them here.

Headers

By default, the following headers are set:

{
  'User-Agent'      => "MetaInspector/#{MetaInspector::VERSION} (+https://github.com/jaimeiniesta/metainspector)",
  'Accept-Encoding' => 'identity'
}

The Accept-Encoding is set to identity to avoid exceptions being raised on servers that return malformed compressed responses, as explained here.

If you want to override the default headers then use the headers option:

# Set the User-Agent header
page = MetaInspector.new('example.com', :headers => {'User-Agent' => 'My custom User-Agent'})

Disabling SSL verification (or any other Faraday options)

Faraday can be passed options via :faraday_options.

This is useful in cases where we need to customize the way we request the page, like for example disabling SSL verification, like this:

MetaInspector.new('https://example.com')
# Faraday::SSLError: SSL_connect returned=1 errno=0 state=SSLv3 read server certificate B: certificate verify failed

MetaInspector.new('https://example.com', faraday_options: { ssl: { verify: false } })
# Now we can access the page

Allow non-HTML content type

MetaInspector will by default raise an exception when trying to parse a non-HTML URL (one that has a content-type different than text/html). You can disable this behaviour with:

page = MetaInspector.new('sitevalidator.com', :allow_non_html_content => true)

page = MetaInspector.new('http://example.com/image.png')
page.content_type  # "image/png"
page.description   # will raise an exception

page = MetaInspector.new('http://example.com/image.png', :allow_non_html_content => true)
page.content_type  # "image/png"
page.description   # will return a garbled string

URL Normalization

By default, URLs are normalized using the Addressable gem. For example:

# Normalization will add a default scheme and a trailing slash...
page = MetaInspector.new('sitevalidator.com')
page.url # http://sitevalidator.com/

# ...and it will also convert international characters
page = MetaInspector.new('http://www.詹姆斯.com')
page.url # http://www.xn--8ws00zhy3a.com/

While this is generally useful, it can be tricky sometimes.

You can disable URL normalization by passing the normalize_url: false option.

Image downloading

When you ask for the largest image on the page with page.images.largest, it will be determined by its height and width attributes on the HTML markup, and also by downloading a small portion of each image using the fastimage gem. This is really fast as it doesn't download the entire images, normally just the headers of the image files.

If you want to disable this, you can specify it like this:

page = MetaInspector.new('http://example.com', download_images: false)

Caching responses

MetaInspector can be configured to use Faraday::HttpCache to cache page responses. For that you should pass the faraday_http_cache option with at least the :store key, for example:

cache = ActiveSupport::Cache.lookup_store(:file_store, '/tmp/cache')
page = MetaInspector.new('http://example.com', faraday_http_cache: { store: cache })

Exception Handling

Web page scraping is tricky, you can expect to find different exceptions during the request of the page or the parsing of its contents. MetaInspector will encapsulate these exceptions on these main errors:

MetaInspector::TimeoutError. When fetching a web page has taken too long.
MetaInspector::RequestError. When there has been an error on the request phase. Examples: page not found, SSL failure, invalid URI.
MetaInspector::ParserError. When there has been an error parsing the contents of the page.
MetaInspector::NonHtmlError. When the contents of the page was not HTML. See also the allow_non_html_content option

Examples

You can find some sample scripts on the examples folder, including a basic scraping and a spider that will follow external links using a queue. What follows is an example of use from irb:

$ irb
>> require 'metainspector'
=> true

>> page = MetaInspector.new('http://sitevalidator.com')
=> #<MetaInspector:0x11330c0 @url="http://sitevalidator.com">

>> page.title
=> "MarkupValidator :: site-wide markup validation tool"

>> page.meta['description']
=> "Site-wide markup validation tool. Validate the markup of your whole site with just one click."

>> page.meta['keywords']
=> "html, markup, validation, validator, tool, w3c, development, standards, free"

>> page.links.size
=> 15

>> page.links[4]
=> "/plans-and-pricing"

Contributing guidelines

You're more than welcome to fork this project and send pull requests. Just remember to:

Create a topic branch for your changes.
Add specs.
Keep your fake responses as small as possible. For each change in spec/fixtures, a comment should be included explaining why it's needed.
Update README.md if needed (for example, when you're adding or changing a feature).

Thanks to all the contributors:

https://github.com/jaimeiniesta/metainspector/graphs/contributors

You can also come to chat with us on our Gitter room and Google group.

Related projects

go-metainspector, a port of MetaInspector for Go.
Node-MetaInspector, a port of MetaInspector for Node.
MetaInvestigator, a port of MetaInspector for Elixir.
Funkspector, another port of MetaInspector for Elixir.

License

MetaInspector is released under the MIT license.

metainspector's People

Contributors

Stargazers

Watchers

Forkers

rromanchuk andyt ikataitsev netconstructor oriolgual gautamrekolan gabceb macbury t7y limonka settinghead saihgal natarius matkuscz dparadis lorgio ck3g lightyrs ainformatico jjb ej brendon9 bastiendonjon thesubr00t lethalbrains scverano sandymliao jgandt naveda89 rushcut dmitriy-kiriyenko itspradhan hokaccha antlypls mmmmmrob codeinvain gustavoguimaraes sameer-tilak rubycoder flaviacarioca rothanachoun qiun janvaniperen davidesantangelo jvenezia termscorpx lankenau paludis modulexcite ehaselwanter michaelsid tjlivesey ulugbekov dlynam dentarg mhui elfassy alexkravets homeflow developer88 goddamnyouryan rakibulislam trotzig shyamster mib32 mdbudnick outwitevil krzyzak vanilinka thuylc ryanb082 alfie-max toshimaru onlim netinmax weradsffads kkimberley straight-shoota sineed coneybeare malina pego nordringrayhide pisaq stex shantum 2youngkim vxvinh itbeaver teohm tidingscompany gordonturibamwe nidise mausconi stefl ducu moonflash oyeanuj damaneice rkmunusamy

metainspector's Issues

page.images fails on malformed html (rare)

Throws a fatal for this page www.guardian.co.uk/media/pda/2011/sep/15/techcrunch-arrington-startups

...because of a drunk developer o_O

<img width="130" height="121" alt="" />

fails here when src is nil.value

def images
      @data.images ||= parsed_document.search('//img').map{ |i|  i.attributes['src'].value }.uniq
end

Test failure in master

is this expected?

  1) MetaInspector::Request exception handling should handle socket errors
     Failure/Error: logger.should receive(:<<).with(an_instance_of(SocketError))
       (#<MetaInspector::ExceptionLog:0x007fe86b591658>).<<(#<RSpec::Mocks::ArgumentMatchers::InstanceOf:0x007fe86b590690 @klass=SocketError>)
           expected: 1 time with arguments: (#<RSpec::Mocks::ArgumentMatchers::InstanceOf:0x007fe86b590690 @klass=SocketError>)
           received: 0 times with arguments: (#<RSpec::Mocks::ArgumentMatchers::InstanceOf:0x007fe86b590690 @klass=SocketError>)
     # ./spec/request_spec.rb:47:in `block (3 levels) in <top (required)>'

On safe/all redirects, the new uri should be available from Metainspector

On :safe and :all redirects, the new uri after the redirect should be available as a property of Metainspector. This is the behavior used by open-uri.

request = open("http://www.facebook.com", { :allow_redirections => :all })
request.base_uri # returns https://www.facebook.com

@jaimeiniesta I was wondering what your take is on replacing the url, scheme, host and root_url or just saving the new uri on a different variable (i.e: redirect_url, redirect_scheme, redirect_host, etc).

how I can get a full description of the website without automatically deducted

when i try this gem is very cool, but i need a full of description of web page, how i can get full of description from page using metainspector ? I've tried but have not got it.

Iterator for all meta tags?

Hello - is there a way I can iterate through all the meta tags on the page? Also, pull meta tags that are not necessarily OG tags or other non-standard ones - what about custom meta tags? If I had an iterator through it all that would be enough I think.

Error 404 when scrapping page accessible through browser.

Hi,

There is a new kind of problem I've got into.
Several days ago I could scrappe pages from zara's website, but now when I try, it gives me an open uri error 404. There is a problem when I try to pin the url on pinterest also, but facebook does it with ease.

For example:
http://www.zara.com/us/en/woman/coats/loose-fit-trench-coat-c367501p1933522.html

Documentation is outdated

It seems the page.document and page.parsed_document methods used in the example are now private and renamed

Relative internal links are not being properly absolutified

When parsing http://observatory.derbyshire.gov.uk/IAS/transportaccess/traveltowork.aspx for internal links, the link to "../sitemap" should be absolutified as:

http://observatory.derbyshire.gov.uk/IAS/sitemap

Instead, it's being wrongly absolutified as:

http://observatory.derbyshire.gov.uk/IAS/transportaccess/sitemap

Open-uri doesn't allow redirects from http to https

Pages such as http://www.facebook.com redirect to https://www.facebook.com and causes open-uri to throw the exception: "redirection forbidden: http://facebook.com -> https://facebook.com/".

The issue is posted on the open-uri db and has 2.0.0 as the target version. This gist (https://gist.github.com/1271420) has a monkey patch for OpenUri that enables unsafe redirects which fixes this issue.

Decomposing into smaller classes

I'm working on decomposing the Scraper class into smaller classes, with clearer responsibilities.

You can see the changes here (work in progress!)

https://github.com/jaimeiniesta/metainspector/compare/smaller

Crash on malformed javascript links

internal_links and external_links make MetaInspector crash when a page contains a link like:

a href="javascript://"

Related to #34

redirection loop on redirections that use cookies

As explained on #57 by @daviddeparolesa there is a problem following redirections that involve cookies:

➔ irb
2.0.0-p195 :001 > require 'metainspector'
 => true
2.0.0-p195 :002 > MetaInspector.new('http://6thfloor.blogs.nytimes.com/2014/01/23/how-our-hillary-clinton-cover-came-about', allow_redirections: :all)
URI::InvalidURIError: the scheme http does not accept registry part: :80 (or bad hostname?)

Reviewing external links

Currently, we have 3 methods for dealing with link scraping:

https://github.com/jaimeiniesta/metainspector/blob/master/lib/meta_inspector/parser.rb#L64-L77

links will return all links found on a page, internal_links will give you only those that are on the same host as the page, and external_links is equivalent to links minus internal_links.

What's not so clear is that, due to the current implementation, external_links will also return non-http links, that is, mailto, telnet, ftp, javascript links...

I don't like this behaviour, I think that what you'd expect for external_links would be only HTTP links. So I'm thinking about fixing this, and adding a new method for non_http_links. So it would be something like this:

m = MetaInspector.new('http://example.com')

m.internal_links # ['http://example.com', 'http://example.com/faqs']

m.external_links # ['https://github.com', 'http://twitter.com']

m.non_http_links # ['mailto:[email protected]', "javascript:alert('hey');"]

All this was motivated by this example where, to get the links to external pages, I had to filter them using a regexp.

What do you think? Would you consider this a bug fix or a change in the API?

Bug in absolutify_url

Absolutification of relative urls is failing because it's getting the root_url instead of the url of the main page. Like in:

url = "http://example.com/company/"
relative "about"

url + about should be "http://example.com/company/about", but it's returning "http://example.com/about"

Website prevents scrapping with redirections?

For example:
http://www.topman.com/en/tmuk/product/clearance-2800326/sale-t-shirts-vests-140905/taxonomy-black-floral-fade-out-t-shirt-2319444?bi=1&ps=20

When I try to scrap this page, MetaInspector fetches data from redirections, instead of the base url.
The base url clearly has meta property og:image when I inspect it.

I'm sorry, but what's the catch here?

page.meta_og_site_name returns nil

Hi,

I get a nil value when calling "page.meta_og_site_name" . I've confirmed this across several sites that have this meta property set correctly.

Example:
http://gothamist.com/2014/01/05/6926_years_of_nyc_history_disappear.php

should return "Gothamist"

Could the underscore in the property name be the culprit?

Thanks!

Add testing for ruby 2.1.0

As commented on #57 (comment) we should test on Ruby 2.1.0

Here is my project. Just wanted to share with you :)

http://www.listomnia.com/

Thanks for the gem Jaime.

1.9 syntax for v3?

i'm working on the various v3 features now. what do you think of dropping support for ruby < 1.9?

uninitialized constant MetaInspector::VERSION

After the latest merge I get error:
uninitialized constant MetaInspector::VERSION

Guess something's going wrong in my app while setting headers...

Also setting a custom header doesn't seem to work this out.

Strange 503 with stackoverflow

I have no idea why yet, probably getting blocked by useragent/ip. Will investigate.

Loading production environment (Rails 3.2.3) 1.9.3p125 :001 > v = "http://stackoverflow.com/questions/3573955/how-can-i-hide-keyboard-when-i-press-return-key" => "http://stackoverflow.com/questions/3573955/how-can-i-hide-keyboard-when-i-press-return-key" 1.9.3p125 :002 > page = MetaInspector.new(v) => #<MetaInspector::Scraper:0x00000004e64930 @url="http://stackoverflow.com/questions/3573955/how-can-i-hide-keyboard-when-i-press-return-key", @scheme="http", @data=#<Hashie::Rash url="http://stackoverflow.com/questions/3573955/how-can-i-hide-keyboard-when-i-press-return-key">> 1.9.3p125 :003 > page.titleAn exception occurred while trying to fetch the page! 503 Service Unavailable 503 Service Unavailable => ""

Can you please add support for fetching favicon url & all metatags in .metatags method

First of all thanks 😄 for the gem 👍 and can you please add support for fetching favicon url & all metatags in .metatags method

Relative links improperly parsed on documents that use the <base> tag

In the case of documents that use the tag to define a base URL, relative links are not properly parsed as this tag is not being taken into account.

Accept variable with html content

I'd like to be able to call MetaInspector.new(url.body_str) as I'm already calling curb for the information in other uses, I've not found a way of doing this.. is it possible / can it be added

Problem with fetching site title

When fetching data from this site:

http://nymag.com/thecut/2014/09/remembering-joan-riverss-iconic-style.html?om_rid=AACo6R

The title is:

'Remembering Joan Rivers’s Iconic Style -- The CutDrop down arrowClose Dialog IconSearch iconMenu iconArrowArrowFacebook iconTwitter iconGoogle+ iconPinterest iconWhatsapp iconEmail iconComment iconPrint iconClose Dialog IconOpen SlideshowFacebook iconTwitter iconGoogle+ iconPinterest iconWhatsapp iconEmail iconPrint iconClose Dialog IconClose iconEmail iconFacebook iconPinterest iconTwitter iconiPad iconInstagram iconRSS iconFeedly iconArrowArrow'

it should be:

'Remembering Joan Rivers’s Iconic Style'

Just wondering why this happened. I am using the 'og:title' as a work around currently.

Use base url to absolutify images

Following the change at #46 we should also use the <base href="..." /> value to absolutify images and image for relative URLs.

internal/external links fail when they contain international characters

To reproduce this, try:

MetaInspector.new('http://en.wikipedia.org/wiki/Roman%C3%A9e-Conti').internal_links

There's trouble with URI.parse handling international characters:

URI::InvalidURIError: bad URI(is not URI?): http://en.wikipedia.org/w/index.php?title=Romanée-Conti&oldid=498422634
from /Users/jaime/.rvm/rubies/ruby-1.9.3-p194/lib/ruby/1.9.1/uri/common.rb:176:in split' from /Users/jaime/.rvm/rubies/ruby-1.9.3-p194/lib/ruby/1.9.1/uri/common.rb:211:inparse'
from /Users/jaime/.rvm/rubies/ruby-1.9.3-p194/lib/ruby/1.9.1/uri/common.rb:747:in parse' from /Users/jaime/.rvm/gems/ruby-1.9.3-p194/gems/metainspector-1.10.0/lib/meta_inspector/scraper.rb:48:inblock in internal_links'
from /Users/jaime/.rvm/gems/ruby-1.9.3-p194/gems/metainspector-1.10.0/lib/meta_inspector/scraper.rb:48:in select' from /Users/jaime/.rvm/gems/ruby-1.9.3-p194/gems/metainspector-1.10.0/lib/meta_inspector/scraper.rb:48:ininternal_links'

Question: Should an exception be added when a 404 is found?

In version 1.15.3, pages that returned a 404's raised/stored an exception (and ok? returned false when warn_level was set to :store) as expected. (I have a test in my application that confirms this.)

In more recent versions, it seems like a 404 no longer raises the expected exception. (Updating to 3.x broke the behavior I was testing.)

So before I dive in deeper to understand why I'm not seeing an exception as expected, should a scraped 404 raise an exception in newer versions?

To be clear, I'll do the legwork on this if a maintainer can confirm or correct my expectations.

Sorry to ask this here but I'm unaware of an IRC channel or google group or anything.

charguess.so undefined symbol: _ZTVN10cxxabiv117class_type_infoE

I guess you already know... ernesto-jimenez/charguess#4

should this exception be logged to the logger?

Seems to be the only place in the whole codebase where you don't log a rescued exception.

https://github.com/jaimeiniesta/metainspector/blob/master/lib/meta_inspector/url.rb#L48

timeouts and retries

in my app, sometimes metainspector gets a timeout when accessing a url.

i would like to build in support for timeouts and retries. for example, set the connection+response timeout to 5 seconds, and do 3 retries (4 total tries) before giving up. (i'll do research on what best/common practices are for those numbers)

an approach to doing this is described here: http://stackoverflow.com/questions/5680100

would you be interested in such a feature for 3.0?

Redirection fatal for http -> https

meatinspector currently fails if there is an http to https redirection due to an issue with open URI. http://redmine.ruby-lang.org/issues/3719

An example to reproduce
ruby-1.9.2-p136 :001 > page = MetaInspector.new('http://wepay.com')
=> #<MetaInspector::Scraper:0x000001043c00c8 @address="http://wepay.com", @links=nil, @image=nil, @Keywords=nil, @description=nil, @title=nil, @document=nil>
ruby-1.9.2-p136 :002 > page.title
An exception occurred while trying to fetch the page!
ruby-1.9.2-p136 :003 > open(page.address).read
RuntimeError: redirection forbidden: http://wepay.com -> https://www.wepay.com/
from /Users/rromanchuk/.rvm/rubies/ruby-1.9.2-p136/lib/ruby/1.9.1/open-uri.rb:216:in `open_loop'

I was able to fix this by patching the regular expression in def OpenURI.redirectable? to include https. It's a pretty annoying/subtle bug so I thought it was worth reporting.

module OpenURI
def OpenURI.redirectable?(uri1, uri2) # :nodoc:
# This test is intended to forbid a redirection from http://... to
# file:///etc/passwd.
# https to http redirect is also forbidden intentionally.
# It avoids sending secure cookie or referer by non-secure HTTP protocol.
# (RFC 2109 4.3.1, RFC 2965 3.3, RFC 2616 15.1.3)
# However this is ad hoc. It should be extensible/configurable.
uri1.scheme.downcase == uri2.scheme.downcase ||
(/\A(?:http|https|ftp)\z/i =~ uri1.scheme && /\A(?:http|https|ftp)\z/i =~ uri2.scheme)
end
end

Nice to add: page.favicon

I would be nice to get the page favicon.

Can't use international characters in host

MetaInspector crashes when initializing it with a URL that contains international characters:

MetaInspector.new('http://example.com/olé')

Bad URI. Is not an URI?

Possible fixes are encoding the URL before parsing it, or get the host and scheme using a regexp.

uninitialized constant Faraday::ConnectionFailed

➔ irb
r2.1.2 :001 > require 'metainspector'
 => true
2.1.2 :002 > MetaInspector::VERSION
 => "3.1.0"
2.1.2 :003 > MetaInspector.new('http://www.google.com')
NameError: uninitialized constant Faraday::ConnectionFailed
    from /Users/jaime/.rvm/gems/ruby-2.1.2@w3cloveapp/gems/metainspector-3.1.0/lib/meta_inspector/request.rb:46:in `rescue in response'
    from /Users/jaime/.rvm/gems/ruby-2.1.2@w3cloveapp/gems/metainspector-3.1.0/lib/meta_inspector/request.rb:40:in `response'
    from /Users/jaime/.rvm/gems/ruby-2.1.2@w3cloveapp/gems/metainspector-3.1.0/lib/meta_inspector/request.rb:23:in `initialize'
    from /Users/jaime/.rvm/gems/ruby-2.1.2@w3cloveapp/gems/metainspector-3.1.0/lib/meta_inspector/document.rb:28:in `new'
    from /Users/jaime/.rvm/gems/ruby-2.1.2@w3cloveapp/gems/metainspector-3.1.0/lib/meta_inspector/document.rb:28:in `initialize'
    from /Users/jaime/.rvm/gems/ruby-2.1.2@w3cloveapp/gems/metainspector-3.1.0/lib/meta_inspector.rb:17:in `new'
    from /Users/jaime/.rvm/gems/ruby-2.1.2@w3cloveapp/gems/metainspector-3.1.0/lib/meta_inspector.rb:17:in `new'
    from (irb):3
    from /Users/jaime/.rvm/rubies/ruby-2.1.2/bin/irb:11:in `<main>'

Stop using metaprogramming to get meta tags

Currently we're finding meta tags using metaprogramming, like:

page.meta_robots
page.meta_og_video_width

We've recently found a limitation on this model on #55, for the case when the meta tag itself uses underscores, like in og:site_name or og:audio:secure_url. To solve this I've applied a temporary patch for this case, but I'm not satisfied with the results.

I'm going to explore a simpler way, which could be something like:

page.meta['robots']
page.meta['og:video:width']
page.meta['og:site_name']

That is, a simple meta method that will return a hash with all the meta found on the document, which can be looked up by its key.

We also need to deal with meta tags that can have several values, like in this example:

<meta property="og:locale" content="en_GB" />
<meta property="og:locale:alternate" content="fr_FR" />
<meta property="og:locale:alternate" content="es_ES" />

A possible solution is having the meta method return a hash where the values are always an array, like:

{
  "og:locale" => ["en_GB"],
  "og:locale:alternate" => ["fr_FR", "es_ES"]
}

timeout option giving an error

everytime I add an option :timeout when instantiating MetaInspector, i get this following error:

Scraping exception: undefined method zero?' for {:timeout=>30}:Hash Parsing exception: undefined methodmatch' for #Array:0x007ff1b7fc95f8
NoMethodError: undefined method `xpath' for #Array:0x007ff1b7fc95f8

what seems to be the problem?

Opinion wanted: store exceptions, or raise them?

Hey everyone, I wanted to know your opinion on this.

Since #16 we've been storing the exceptions in the errors array and the way to check if everything was right after a query is to check the ok? method.

I've been thinking about this and I'm not really sure if this is still a good idea. I mean, maybe it's simpler and more effective to let the exceptions raise when they occur (as @rromanchuk suggested when he sent this PR).

If you had to deal with it on a new gem, how would you treat exceptions? Like timeouts, network errors, nokogiri parsing errors...? Raise them as they come? Raise our own Metainspector errors? Store them like we do now?

What do you think @andyt @contentdj @ffmike @gabceb @gautamrekolan @ikataitsev @iteh @limonka @macbury @natarius @netconstructor @oriolgual @rromanchuk @settinghead ?

Thanks!!!

Crash on malformed mailto links

MetaInspector crashes on this scenario:

p=MetaInspector.new('"http://www.stressfaktor.squat.net/adressen")
p.internal_links

It seems that there is a malformed mailto:

URI::InvalidComponentError: unrecognised opaque part for mailtoURL: systemfehler-berlin(at)web.de

smarter image picking

i'd like to be able to do smarter image picking somehow. we could…

use og metadata for width and height to present to the user the og images in decreasing size (few sites will have both multiple images and the size data available though)
download all images, determine their size, then present them in an array of decreasing size
use nokogiri to somehow intelligently guess which images are from the content area of the page (there must be existing recipes for this out there)

thoughts?

this is maybe a 3.1 feature.

new(<url>, warn_level: :store) does not honor warn_level.

The default options do not allows for creation of an ExceptionLog with the entered warn_level here: https://github.com/jaimeiniesta/metainspector/blob/master/lib/meta_inspector/document.rb#L26

You could work around this by passing in a manually created exception_log, but that negates the whole point of having the warn_level option.

I will try to come up with an example and PR for this soon.

Problem with fetching metadata of some websites

Here is an example:

MetaInspector.new("http://www.highsnobiety.com", html_content_only: true, verbose: true).to_hash
 => {"url"=>"http://www.highsnobiety.com/", "title"=>nil, "links"=>[], "internal_links"=>[], "external_links"=>[], "images"=>[], "charset"=>nil, "feed"=>nil, "content_type"=>"text/html", "meta"=>{"name"=>{}, "property"=>{}}}

Do you have any idea why this is happening?

undefined method `value' for nil:NilClass

Hi,

I'm trying to use metainspector to grab meta data in a rails app. I consistently get the following when trying page.image or page.description. It works, however, in the rails console. Below is the trace.

metainspector (1.9.0) lib/meta_inspector/scraper.rb:127:in block in method_missing' nokogiri (1.5.3) lib/nokogiri/xml/node_set.rb:239:inblock in each'
nokogiri (1.5.3) lib/nokogiri/xml/node_set.rb:238:in upto' nokogiri (1.5.3) lib/nokogiri/xml/node_set.rb:238:ineach'
metainspector (1.9.0) lib/meta_inspector/scraper.rb:126:in method_missing' metainspector (1.9.0) lib/meta_inspector/scraper.rb:29:indescription'
app/models/twitter_job.rb:38:in find_or_create' app/models/twitter_job.rb:15:inblock in stories_pull'
app/models/twitter_job.rb:13:in each' app/models/twitter_job.rb:13:instories_pull'
app/models/twitter_job.rb:5:in pull' app/controllers/stories_controller.rb:14:infetch_stories'
app/controllers/stories_controller.rb:7:in pull' actionpack (3.2.2) lib/action_controller/metal/implicit_render.rb:4:insend_action'
actionpack (3.2.2) lib/abstract_controller/base.rb:167:in process_action' actionpack (3.2.2) lib/action_controller/metal/rendering.rb:10:inprocess_action'
actionpack (3.2.2) lib/abstract_controller/callbacks.rb:18:in block in process_action' activesupport (3.2.2) lib/active_support/callbacks.rb:425:in_run__2007603356777030170__process_action__3443411133637203251__callbacks'
activesupport (3.2.2) lib/active_support/callbacks.rb:405:in __run_callback' activesupport (3.2.2) lib/active_support/callbacks.rb:385:in_run_process_action_callbacks'
activesupport (3.2.2) lib/active_support/callbacks.rb:81:in run_callbacks' actionpack (3.2.2) lib/abstract_controller/callbacks.rb:17:inprocess_action'
actionpack (3.2.2) lib/action_controller/metal/rescue.rb:29:in process_action' actionpack (3.2.2) lib/action_controller/metal/instrumentation.rb:30:inblock in process_action'
activesupport (3.2.2) lib/active_support/notifications.rb:123:in block in instrument' activesupport (3.2.2) lib/active_support/notifications/instrumenter.rb:20:ininstrument'
activesupport (3.2.2) lib/active_support/notifications.rb:123:in instrument' actionpack (3.2.2) lib/action_controller/metal/instrumentation.rb:29:inprocess_action'
actionpack (3.2.2) lib/action_controller/metal/params_wrapper.rb:205:in process_action' activerecord (3.2.2) lib/active_record/railties/controller_runtime.rb:18:inprocess_action'

3.0 release

@jb I think we can start to think on releasing 3.0, is there anything you want to add before it?

If everything is ready then I'll do a general review of the latest changes, update the README, and release it.

metainspector following redirects when given :document

The senario is, I'm already running a script to get the content of the page and I just want metainspector to analysis that, so I'm using the :document functionaility

@meta = MetaInspector.new(site_url, document: page_body)

however MetaInspector seems to ignore the document and still check the url (and follow a redirect)

for example

page_body = '<html></html>'
site_url = 'http://cmbinfo.com' # a url which has a 301 redirect
meta = MetaInspector.new(site_url, document: page_body)

a.meta_keywords
=> "boston, brand optimization, product development research, market research, marketing, branding research, customer experience research, service development research, social media research, marketing communications research, segmentation research, custom market research, boston market research, john martin, cmbinfo, cmb research"

When in my use case, a should be nil

Links with malformed hrefs throw exception

Pages such as http://www.nationalgeographic.com contain links with href values that are invalid URIs.

The method absolutify_url throws URI::InvalidURIError (bad URI(is not URI?)) when trying to get the links for the page.

We should be able to catch those exceptions and don't add the link to the links array.

"undefined method css" when extracting meta tags

As submitted by @daviddeparolesa on #56

When I try to access page.meta for this URL, I get an "undefined method css" error:
http://6thfloor.blogs.nytimes.com/2014/01/23/how-our-hillary-clinton-cover-came-about

NoMethodError: undefined method css' for #<Array:0x007f91df009880>
from /Users/david/.rvm/gems/ruby-2.0.0-p353/gems/metainspector-2.0.0/lib/meta_inspector/parser.rb:106:inmeta_tags_by'

When I check page.exceptions:
[#,
#,
#,
#,
#,
#,
#,
#]

Feature Request

Hi, metainspector is just awesome. Is there a way to extract favicon.ico? Pismo gem has this feature, but I use metainspector for everything, so it would be great if metainspector has this feature too.. thanks.

to_hash not working

On metainspector 1.9.0, to_hash is failing:

$ irb
1.9.2p290 :001 > require 'metainspector'
 => true 
1.9.2p290 :002 > m = MetaInspector.new('http://w3clove.com')
 => #<MetaInspector::Scraper:0x007fa519302d28 @url="http://w3clove.com", @scheme="http", @data=#<Hashie::Rash   url="http://w3clove.com">> 
1.9.2p290 :003 > m.to_hash
NoMethodError: undefined method `downcase' for nil:NilClass
from /Users/jaime/.rvm/gems/ruby-1.9.2-p290@metainspector/gems/metainspector-1.9.0/lib/meta_inspector/scraper.rb:74:in `charset'
from /Users/jaime/.rvm/gems/ruby-1.9.2-p290@metainspector/gems/metainspector-1.9.0/lib/meta_inspector/scraper.rb:80:in `to_hash'
from (irb):3
from /Users/jaime/.rvm/rubies/ruby-1.9.2-p290/bin/irb:16:in `<main>'

Newlines in titles

This might be controversial, but it might be nice to strip any newlines inside Scraper#title

<title> Carol Bartz exclusive: Yahoo "f---ed me over" - Postcards </title> # page.title

=> "\n\t\t Carol Bartz exclusive: Yahoo "f---ed me over" - \n\t\tPostcards\t"

Ran into above at http://postcards.blogs.fortune.cnn.com/2011/09/08/carol-bartz-fired-yahoo/ We could probably do the callee a favor by cleaning up extra markup which is usually not expected for a title.

LMK and I'll check it in..and add the rest of the missing tests