Giter VIP home page Giter VIP logo

fuji's People

Contributors

afuetterer avatar az-ihsan avatar broeder-j avatar dependabot[bot] avatar dfsp-spirit avatar dmusulas avatar huberrob avatar ignpelloz avatar jantau avatar karacolada avatar kitchenprinzessin3880 avatar marc-portier avatar marioa avatar pevans-gfz avatar sneumann avatar uschindler avatar vemonet avatar wilkos-dans avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

fuji's Issues

FsF-F3-01M checks for wrong identifier

Test now checks for presence of PID (aka the 'dataset identifier') but 'data identifier' is required instead.

Therefore the schema.org check:

'data_identifier: ("@id" || url."@id" || identifier.value ) || url, data_access_url: (distribution.contentUrl || distribution[*].contentUrl)}'

has to be changed to:

'data_identifier: (data_access_url: (distribution.contentUrl || distribution[*].contentUrl)}'

FsF-F2-01M not recognising dcterms as string

In metric 3 the tool reports no found metadata for the following dataset:

http://dda.dk/catalogue/150

however, it includes a number of dcterms in the form of string elements:
dcterms:title
dcterms:description
dcterms:publisher
dcterms:identifier
... and more.

The methods of metadata interpretation should be as broad as possible so as to maximise the evaluation results.

Testing restricted data

There should be a mechanism to test metadata and data of restricted data (Http status code 401).
Currently, only accessible data_content_identifiers are considered as part of the assessment. If the identifier is not accessible it will be not excluded and this has influence on other tests performed (e.g., metadata includes identifier, data content properties, file format, etc.)

Provenance Properties

Mapping should be extended to cover the selected fields as follows (ref: https://www.w3.org/TR/prov-dc/)

Descriptive metadata What abstract, accrualMethod, accrualPeriodicity, accrualPolicy, alternative, audience, bibliographicCitation, conformsTo, coverage, description, educationLevel, extent, format, hasPart, isPartOf, identifier, instructionalMethod, isRequiredBy, language, mediator, medium, relation, requires, spatial, subject, tableOfContents, temporal, title, type
Provenance Who contributor, creator, publisher, rightsHolder
Provenance When available, created, date, dateAccepted, dateCopyrighted, dateSubmitted, issued, modified, valid
Provenance How accessRights, hasFormat, hasVersion, isFormatOf, isVersionOf, license, isReferencedBy, isReplacedBy, references, replaces, rights, source

'provenance' --> This leaves one very special term: provenance. This term is defined as a "statement of any changes in ownership and custody of the resource since its creation that are significant for its authenticity, integrity, and interpretation" [DCTERMS], a definition that corresponds to the notion of provenance for artworks. This term can be considered a link between the resource and any provenance statement about the resource, so it is not included in any of the aforementioned categories.

Change way we model links

Currently we just use the URL to store links. However if we want to check content, we also need mime type, size etc.
This information is available inschema.org, daacite as well as DCAT metadata. in issue #10 I proposed to change method get_html_typed_links. The new method would return a dictionary instead of a URL string only.
Dictionary structure is: {'href':URL,'format':FORMAT, 'rel':REL, 'type':TYPE}

See: https://signposting.org/conventions/#attributes

We should therefore also change JSON queries accordingly..

Structured data embedded in the data page

Strctured might be include in the html page through external javascript.there should be a mechanism to crawl dynamic web page. It looks like there are a few different frameworks (like selenium and scrapy-splash) that you can tie in with Python to do this.

Improve the flow of the community-standards check

There are several ways of retrieving standards supported by the repository (namespaces, re3data api,standards, provided endpoint). the workflow should be refined by identifying primary and secondary sources.

Bugs Found

Schema.org Types Suported

@ajaunsen

Evaluating e.g.
https://www.proteinatlas.org/ENSG00000110651-CD81/cell

Metric 5 claims no JSON-LD is found.

Metric 3 reports:
"INFO: Found JSON-LD schema.org but record is not of type "Dataset\””,

Investigating the content I find there is indeed a ld+json object, but of type=“DataRecord”. Not sure if this is a previous type of schema.org that has become deprecated or what… but the data provider clearly has defined structured data according to some schema.org form.

Metrics YAML File

  1. New version should be created and include the objects metrics descriptions from v3.0.
  2. Add assessment aspects for each metric.
  3. Update server.ini file

Improvement based on Signposting the Scholarly Web (http://www.signposting.org/)

  1. By now we only test typed links in the HEAD section of the landing page. We should also test for links included in the header response following the signposting.org convention (e.g., get/head request)
  2. There will be a proposal from @hvdsomp on the final recommendations on implementing link relation types (including linkset) in the data landing page. To test the recommendations, we need (a) pilot repositories (e.g., pangaea) implementing them and fuji assessment should be extended to reflect the final recommendations.

Store namespace URIs and/or prefixes

it seems as if namespaces used in e.g. RDF metadata can be very valuable for several metrics, because we can use them for:

  1. assessing if community specific vocabularies are used
  2. assessing if community specific metadata is used
  3. assessing if provenance metadata is included
    etc..
    We should introduce a variable (list) which stores this in the FAIRCheck class?

Further we should feed this variable in various methods while checking the metadata.. e.g. when we retrieve dublin core, json-ld (schema.org), datacite and dcat etc.. metadata

Metadata validation

Metadata (schema.org) should be validated against its schema before the document is considered part of the assessment.

Data file format checks

In current implementation, only file formats of active content urls are considered. there can be a case where the format is specified in the metadata but the data content url is missing. there should be a debug message indicating it.
example of data: ttps://hdl.handle.net/10411/G8MPEI

Re3data metadata standards entries do not contain correct schema or namespace URL

Unfortunately re3data does not give back e.g. schema location URL as we would expect this for XML schemas like ABCD or DC, it instead links to a dcc.ac.uk catalogue entry.

This is a problen since the unique identifier for a XML schema surely is its schema location or namespace.

Example out put from Re3data:

r3d:metadataStandard
<r3d:metadataStandardName metadataStandardScheme="DCC">ABCD - Access to Biological Collection Data</r3d:metadataStandardName>
r3d:metadataStandardURLhttp://www.dcc.ac.uk/resources/metadata-standards/abcd-access-biological-collection-data</r3d:metadataStandardURL>
</r3d:metadataStandard>

get_html_typed_links

@huberrob
{'url': href, 'type': l.attrib.get('type'), 'rel': l.attrib.get('rel'), 'profile': l.attrib.get('format')}
Cannot find 'format' attribute. Is this standard attribute of rel?

Change method get_html_typed_links from regex testing to DOM+xpath

The following code does this and would also solve issue #7

def get_html_typed_links(self, rel="item"):
# Use Typed Links in HTTP Link headers to help machines find the resources that make up a publication.
# Use links to find domains specific metadata
datalinks = []

    dom = lxml.html.fromstring(self.landing_html.encode('utf8'))
    links=dom.xpath('/*/head/link[@rel="'+rel+'"]')
    for l in links:
        href=l.attrib.get('href')
        #handle relative paths
        if href.startswith('/'):
            href=self.landing_origin+href
        datalinks.append({'href':href,'format':l.attrib.get('format'), 'rel':l.attrib.get('rel'), 'type':l.attrib.get('type')})
    return datalinks

FsF-F2-01M: change sequence of tests

Now we always perform all possible tests (og, schema.org. dc, datacite etc) to check for metadata in a 'from less to more imortant' sequence. But if we e.g. already have the complete set of required metadata from schema.org, there is no need anymore to call the datacite API or so.

I would therefore propose the following sequence:

First check for embedded schema.org metadata. If metadata is not complete check datacite. If is again not complete try some content negotiation to get some xml-rdf and get e.g. DCAT records.

Then get embedded DC, OG etc. as a fallback option

Evaluations fail with KeyError: 'content-type'

Open or restricted access to CC licenced data

Found here:

https://creativecommons.org/faq/#can-i-share-cc-licensed-material-on-password-protected-sites

Can I share CC-licensed material on password-protected sites?
Yes. This is not considered to be a prohibited measure, so long as the protection is merely limiting who may access the content, and does not restrict the authorized recipients from exercising the licensed rights. For example, you may post material under any CC license on a site restricted to members of a certain school, or to paying customers, but you may not place effective technological measures (including DRM) on the files that prevents them from sharing the material elsewhere.

F-UJI Test Service

Host an instance of F-UJI service over https (through pangaea production server)
@uschindler you inputs are needed here, in near future ;)

Metadata External Aggregators

Expand current metadata search from external resources by including b2find search and content-negotiation (rdf-related accept types).

Application does not start.

Hi there,

For EOSC-Synergy we'd like to evaluate this tool. However, We cannot get it started.
Not from cmdline, neither from Docker (both the same issue's).
After fixing dependencies (Levenshtein and rdflib version) the app exits with the following:

python3 -m fuji_server
2020-07-23 15:02:20,861 - root - INFO - Total metrics defined: 14
2020-07-23 15:02:20,902 - root - INFO - Total SPDX licenses : 422
2020-07-23 15:02:20,902 - root - INFO - Total re3repositories found from datacite api : 199
2020-07-23 15:02:20,902 - root - INFO - Total subjects area of imported metadata standards : 97
2020-07-23 15:02:20,902 - root - INFO - Total LD vocabs imported : 1021
2020-07-23 15:02:20,902 - root - INFO - Total default namespaces specified : 20
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/connexion/utils.py", line 120, in get_function_from_name
    function = deep_getattr(module, attr_path)
  File "/usr/local/lib/python3.7/site-packages/connexion/utils.py", line 68, in deep_getattr
    return functools.reduce(getattr, attrs, obj)
AttributeError: module 'fuji_server.controllers' has no attribute 'authorization_controller'

If I remove last line from yaml:

x-basicInfoFunc: fuji_server.controllers.authorization_controller.check_UserLogin

It will start, but only 404 responses are returned.

2020-07-23 14:44:49,253 - root - INFO - Total metrics defined: 14
2020-07-23 14:44:49,269 - root - INFO - Total SPDX licenses : 422
2020-07-23 14:44:49,269 - root - INFO - Total re3repositories found from datacite api : 199
2020-07-23 14:44:49,269 - root - INFO - Total subjects area of imported metadata standards : 97
2020-07-23 14:44:49,269 - root - INFO - Total LD vocabs imported : 1021
2020-07-23 14:44:49,269 - root - INFO - Total default namespaces specified : 20
 * Serving Flask app "__main__" (lazy loading)
 * Environment: production
   WARNING: This is a development server. Do not use it in a production deployment.
   Use a production WSGI server instead.
 * Debug mode: off
2020-07-23 14:44:50,370 - werkzeug - INFO -  * Running on http://0.0.0.0:1071/ (Press CTRL+C to quit)
2020-07-23 14:45:03,840 - werkzeug - INFO - 127.0.0.1 - - [23/Jul/2020 14:45:03] "GET / HTTP/1.1" 404 -
2020-07-23 14:46:04,943 - werkzeug - INFO - 127.0.0.1 - - [23/Jul/2020 14:46:04] "GET /uji/api/v1/swagger.json HTTP/1.1" 404 -
2020-07-23 14:48:18,504 - werkzeug - INFO - 127.0.0.1 - - [23/Jul/2020 14:48:18] "GET / HTTP/1.1" 404 -
2020-07-23 14:53:23,659 - werkzeug - INFO - 127.0.0.1 - - [23/Jul/2020 14:53:23] "GET /evaluate HTTP/1.1" 404 -

Any suggestions?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.