Giter VIP home page Giter VIP logo

unicontent's Introduction

unicontent

Unicontent is a Python library to extract metadata from different types of sources and for different types of objects. The goal is to normalize metadata and to provide an easy-to-use extractor. Given an identifier (URL, DOI, ISBN), unicontent can retrieve structured data about the corresponding object.

Usage

Here is the basic usage if you want to extract metadata with any kind of identifier. unicontent will detect the type of identifier and use the right extractor. Use get_metadata function if you just want metadata.

from unicontent.extractors import get_metadata
data = get_metadata(identifier="http://example.com", format='n3')

See below if you want to use the extractor for a specific kind of identifier (URL, DOI or ISBN).

Extraction from URL

The class URLContentExtractor is used to extract data from an URL. Several formats are available : RDF formats will return a rdflib graph (n3, turtle, xml). 'dict' and 'json' format will return a dictionary and a JSON file according to the mapping defined. A default mapping is provided.

url = 'http://www.lemonde.fr/big-browser/article/2017/02/13/comment-les-americains-s-informent-oublient-et-reagissent-sur-les-reseaux-sociaux_5079137_4832693.html'
url_extractor = URLContentExtractor(identifier=url, format='dict', schema_names=['opengraph', 'dublincore', 'htmltags']) # 'dict' is the default format
metadata_dict = url_extractor.get_data()

The order of the schema_names parameters defines how the extractor will fetch metadata as explained before. Always use htmltags to get at least the <title> tag in the webpage.

Extraction from DOI

The module uses the DOI system Proxy Server to extract metadata from DOI codes. The extractor name is DOIContentExtractor.

doi = '10.10.1038/nphys1170'
doi_extractor = DOIContentExtractor(identifier=doi, format='dict')
metadata_dict = doi_extractor.get_data()

Extraction from ISBN

To retrieve metadata from books, the library uses GoogleBooks and OpenLibrary (in this order). The extractor class is called ISBNContentExtractor. If GoogleBooks does not find the volume corresponding to the ISBN code, a request is sent to OpenLibrary to fetch the data.

unicontent's People

Contributors

hboisgibault avatar marcdeb1 avatar patrickdizon avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

Forkers

patrickdizon

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.