Giter VIP home page Giter VIP logo

metascraper's Introduction


metascraper

Last version Coverage Status Build Status Dependency Status NPM Status

A library to easily scrape metadata from an article on the web using Open Graph, JSON+LD, regular HTML metadata, and series of fallbacks.

Table of Contents

Getting Started

metascraper is library to easily scrape metadata from an article on the web using Open Graph metadata, regular HTML metadata, and series of fallbacks.

It follows a few principles:

  • Have a high accuracy for online articles by default.
  • Make it simple to add new rules or override existing ones.
  • Don't restrict rules to CSS selectors or text accessors.

Installation

$ npm install metascraper --save

Usage

Let's extract accurate information from the following article:

Then call metascraper with the rules bundle you want to apply for extracting content:

const metascraper = require('metascraper')([
  require('metascraper-author')(),
  require('metascraper-date')(),
  require('metascraper-description')(),
  require('metascraper-image')(),
  require('metascraper-logo')(),
  require('metascraper-clearbit')(),
  require('metascraper-publisher')(),
  require('metascraper-title')(),
  require('metascraper-url')()
])

const got = require('got')

const targetUrl = 'http://www.bloomberg.com/news/articles/2016-05-24/as-zenefits-stumbles-gusto-goes-head-on-by-selling-insurance'

;(async () => {
  const { body: html, url } = await got(targetUrl)
  const metadata = await metascraper({ html, url })
  console.log(metadata)
})()

The output will be something like:

{
  "author": "Ellen Huet",
  "date": "2016-05-24T18:00:03.894Z",
  "description": "The HR startups go to war.",
  "image": "https://assets.bwbx.io/images/users/iqjWHBFdfxIU/ioh_yWEn8gHo/v1/-1x-1.jpg",
  "publisher": "Bloomberg.com",
  "title": "As Zenefits Stumbles, Gusto Goes Head-On by Selling Insurance",
  "url": "http://www.bloomberg.com/news/articles/2016-05-24/as-zenefits-stumbles-gusto-goes-head-on-by-selling-insurance"
}

Metadata

?> Other metadata can be defined using a custom rule bundle.

Here is an example of the metadata that metascraper can collect:

How It Works

metascraper is built out of rules bundles.

It was designed to be easy to adapt. You can compose your own transformation pipeline using existing rules or write your own.

Rules bundles are a collection of HTML selectors around a determinate property. When you load the library, implicitly it is loading core rules.

Each set of rules load a set of selectors in order to get a determinate value.

These rules are sorted with priority: The first rule that resolve the value successfully, stop the rest of rules for get the property. Rules are sorted intentionally from specific to more generic.

Rules work as fallback between them:

  • If the first rule fails, then it fallback in the second rule.
  • If the second rule fails, time to third rule.
  • etc

metascraper do that until finish all the rule or find the first rule that resolves the value.

Importing Rules

metascraper exports a constructor that need to be initialized providing a collection of rules to load:

const metascraper = require('metascraper')([
  require('metascraper-author')(),
  require('metascraper-date')(),
  require('metascraper-description')(),
  require('metascraper-image')(),
  require('metascraper-logo')(),
  require('metascraper-clearbit')(),
  require('metascraper-publisher')(),
  require('metascraper-title')(),
  require('metascraper-url')()
])

Again, the order of rules are loaded are important: Just the first rule that resolve the value will be applied.

Use the first parameter to pass custom options specific per each rules bundle:

const metascraper = require('metascraper')([
  require('metascraper-clearbit')({
    size: 256,
    format: 'jpg'
  })
])

Rules Bundles

?> Can't find the rules bundle that you want? Let's open an issue to create it.

Package Version Dependencies
metascraper-amazon npm Dependency Status
metascraper-audio npm Dependency Status
metascraper-author npm Dependency Status
metascraper-clearbit npm Dependency Status
metascraper-date npm Dependency Status
metascraper-description npm Dependency Status
@metascraper/helpers npm Dependency Status
metascraper-image npm Dependency Status
metascraper-lang npm Dependency Status
metascraper-logo npm Dependency Status
metascraper-logo-favicon npm Dependency Status
metascraper-media-provider npm Dependency Status
metascraper-publisher npm Dependency Status
metascraper-readability npm Dependency Status
metascraper-soundcloud npm Dependency Status
metascraper-title npm Dependency Status
metascraper-uol npm Dependency Status
metascraper-url npm Dependency Status
metascraper-video npm Dependency Status
metascraper-youtube npm Dependency Status

Write Your Own Rules

See CONTRIBUTING.

API

constructor(rules)

Create a new metascraper instance declaring the rules bundle to be used explicitly.

rules

Type: Array

The collection of rules bundle to be loaded.

metascraper(options)

Call the instance for extracting content based on rules bundle provided at the constructor.

options

url

Required
Type: String

The URL associated with the HTML markup.

It is used for resolve relative links that can be present in the HTML markup.

it can be used as fallback field for different rules as well.

html

Type: String

The HTML markup for extracting the content.

rules

Type: Array

You can pass additional rules to add on execution time.

These rules will be merged with your loaded rules at the beginning.

Benchmark

To give you an idea of how accurate metascraper is, here is a comparison of similar libraries:

Library metascraper html-metadata node-metainspector open-graph-scraper unfluff
Correct 95.54% 74.56% 61.16% 66.52% 70.90%
Incorrect 1.79% 1.79% 0.89% 6.70% 10.27%
Missed 2.68% 23.67% 37.95% 26.34% 8.95%

A big part of the reason for metascraper's higher accuracy is that it relies on a series of fallbacks for each piece of metadata, instead of just looking for the most commonly-used, spec-compliant pieces of metadata, like Open Graph.

metascraper's default settings are targetted specifically at parsing online articles, which is why it's able to be more highly-tuned than the other libraries for that purpose.

If you're interested in the breakdown by individual pieces of metadata, check out the full comparison summary, or dive into the raw result data for each library.

License

metascraper © Ian Storm Taylor, Released under the MIT License.
Maintained by Kiko Beats with help from contributors.

metascraper's People

Contributors

dependabot-preview[bot] avatar greenkeeper[bot] avatar kikobeats avatar osdiab avatar plaa avatar rista404 avatar shwanton avatar slavaganzin avatar stevenfrostwick avatar thequazz avatar twogood avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.