Giter VIP home page Giter VIP logo

journal-scrapers's Introduction

journal-scrapers

Build Status Coverage License

Journal scraper definitions for the ContentMine framework.

Table of Contents

Summary

This repo is a collection of scraperJSON definitions targeting academic journals. They can be used to extract and download data from URLs of journal articles, such as:

  • Title, author list, date
  • Figures and their captions
  • Fulltext PDF, HTML, XML, RDF
  • Supplementary materials
  • Reference lists

Scraper collection status

All the scrapers in the collection are automatically tested daily as well as every time any scraper is changed. The tests work by having the expected results for a set of URLs stored, and randomly selecting one of those URLs to re-scrape. If the results match those expected the test passes. If the badge is green and says build|passing, all the scrapers are OK. If the badge is red and says build|failing, one or more of the scrapers has stopped working. You can click on the badge to see the test report, to see which scrapers are failing and how.

Build Status

How well the scrapers are covered by the tests is also checked. Coverage should be 100% - this means every element of every scraper is checked at least once in the testing. If coverage is below 100%, you can see exactly which parts of which scrapers are not covered by clicking the coverage badge below.

Coverage

ScraperJSON definitions

Scrapers are defined in JSON, using a schema called scraperJSON which is currently evolving. The current schema is described at the scraperJSON repo.

Contributing scrapers

If your favourite publisher or journal is not covered by a scraper in our collection, we'd love you to submit a new scraper.

We ask that all contributions follow some simple rules that help us maintain a high-quality collection.

  1. The scraper covers all the data elements used in the ContentMine.
  2. You must submit a set of 5-10 test URLs.
  3. It comes with a regression test (which can be auto-generated).
  4. You agree to release the scraper definition and tests under the Creative Commons Zero license.

Usage

Currently these definitions can be used with the quickscrape tool.

License

All scrapers are released under the Creative Commons 0 (CC0) license.

journal-scrapers's People

Contributors

blahah avatar pbulsink avatar merkys avatar sauliusg avatar petermr avatar tarrow avatar chartgerink avatar rossmounce avatar cnjr2 avatar cristiancantoro avatar mcs07 avatar robintw avatar ianthe avatar larsgw avatar

Watchers

James Cloos avatar  avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.