Giter VIP home page Giter VIP logo

gleaner's Introduction

Gleaner

About

Gleaner is the structured data on the web indexing tool developed as part of NSF EarthCube. Its focus is on collection JSON-LD encoding data graphs describing data resources and services. Gleaner can then process and generate a semantic network based on a list of providers.

Basic Gleaner

A set of cloud based tools and functions can be found https://fence.gleaner.io/ These can be usd online via the browser or through command line calls. They are also available for use in Jupyter notebooks do develop out workflows with.

More

The Summoner, which uses site map files to access and parse facility resources pages. Summoner places the results of these calls into a S3 API compliant storage.

The Miller, which takes the JSON-LD documents pulled and stored by summoner and runs them through various millers. These millers can do various things.

Basic Gleaner

The current millers are:

  • text: build a text index in raw bleve
  • spatial: parse and build a spatial index using a geohash server
  • graph: convert the JSON-LD to RDF for SPARQL queries

A set of other millers exist that are more experimental

  • tika: access the actual data files referneced by the JSON-LD and process through Apache Tika. The extracted text is then indexed in text system allowing full text search on the document contents.
  • blast: like text, but using the blast package built on bleve
  • fdptika: like tika, but using Frictionless Data Packages
  • ftpgraph: like graph, but pulling JSON-LD files from Frictionless Data Packages
  • prov: build a basic prov graph from the overall gleaner process
  • shacl: validate the facility resoruces against defined SHACL shape graphs

How to run (or at least try..., this is still a work in progress)

A key focus of current develoipment is to make it easy for groups to run Gleaner locally as a means to test and validate their structured data publishing workflow.

Running

Some early documentation on running gleaner can be found at: Running Gleaner.

Validation (SHACL Shapes)

Work on the validation of data graphs using W3C SHACL shape graphs is taing place in the GeoShapes repository. Gleaner leverages the pySHACL Python package to perform the actual validation.

Profiling (for dev work)

You can profile runs with

go tool pprof --pdf gleaner /tmp/profile317320184/cpu.pprof  > profileRun1.pdf

Example CPU and Memory profile of a recent release.

gleaner's People

Contributors

fils avatar ashepherd avatar valentinedwv avatar

Watchers

James Cloos avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.