Visit https://gleaner.io
Gleaner is the structured data on the web indexing tool developed as part of NSF EarthCube. Its focus is on collection JSON-LD encoding data graphs describing data resources and services. Gleaner can then process and generate a semantic network based on a list of providers.
A set of cloud based tools and functions can be found https://fence.gleaner.io/ These can be usd online via the browser or through command line calls. They are also available for use in Jupyter notebooks do develop out workflows with.
The Summoner, which uses site map files to access and parse facility resources pages. Summoner places the results of these calls into a S3 API compliant storage.
The Miller, which takes the JSON-LD documents pulled and stored by summoner and runs them through various millers. These millers can do various things.
The current millers are:
- text: build a text index in raw bleve
- spatial: parse and build a spatial index using a geohash server
- graph: convert the JSON-LD to RDF for SPARQL queries
A set of other millers exist that are more experimental
- tika: access the actual data files referneced by the JSON-LD and process through Apache Tika. The extracted text is then indexed in text system allowing full text search on the document contents.
- blast: like text, but using the blast package built on bleve
- fdptika: like tika, but using Frictionless Data Packages
- ftpgraph: like graph, but pulling JSON-LD files from Frictionless Data Packages
- prov: build a basic prov graph from the overall gleaner process
- shacl: validate the facility resoruces against defined SHACL shape graphs
A key focus of current develoipment is to make it easy for groups to run Gleaner locally as a means to test and validate their structured data publishing workflow.
Some early documentation on running gleaner can be found at: Running Gleaner.
Work on the validation of data graphs using W3C SHACL shape graphs is taing place in the GeoShapes repository. Gleaner leverages the pySHACL Python package to perform the actual validation.
You can profile runs with
go tool pprof --pdf gleaner /tmp/profile317320184/cpu.pprof > profileRun1.pdf
Example CPU and Memory profile of a recent release.