pangaea-data-publisher / fuji Goto Github PK

View Code? Open in Web Editor NEW

50.0 50.0 35.0 7.64 MB

FAIRsFAIR Research Data Object Assessment Service

License: MIT License

Python 95.45% HTML 0.01% Dockerfile 0.08% PHP 4.45%

fairdata hacktoberfest

fuji's People

Contributors

Stargazers

Watchers

fuji's Issues

FsF-F3-01M checks for wrong identifier

Test now checks for presence of PID (aka the 'dataset identifier') but 'data identifier' is required instead.

Therefore the schema.org check:

'data_identifier: ("@id" || url."@id" || identifier.value ) || url, data_access_url: (distribution.contentUrl || distribution[*].contentUrl)}'

has to be changed to:

'data_identifier: (data_access_url: (distribution.contentUrl || distribution[*].contentUrl)}'

Extend the tests related to license information

@ajaunsen
it would make sense to add additional score if the license can be resolved.

FsF-F2-01M not recognising dcterms as string

In metric 3 the tool reports no found metadata for the following dataset:

http://dda.dk/catalogue/150

however, it includes a number of dcterms in the form of string elements:
dcterms:title
dcterms:description
dcterms:publisher
dcterms:identifier
... and more.

The methods of metadata interpretation should be as broad as possible so as to maximise the evaluation results.

FsF-R1.1-01M default score seems to be 1 instead of 0

Testing restricted data

There should be a mechanism to test metadata and data of restricted data (Http status code 401).
Currently, only accessible data_content_identifiers are considered as part of the assessment. If the identifier is not accessible it will be not excluded and this has influence on other tests performed (e.g., metadata includes identifier, data content properties, file format, etc.)

exclude 'file_type' from the response.

i excluded 'file_type' from the response.
@huberrob {'url': None, 'type': 'text/tab-separated-values', 'size': '902 data points', 'profile': 'http://datacite.org/schema/kernel-4'}
format may be specified in the metadata. but the data content uri is missing. will this pass the test?

Originally posted by @kitchenprinzessin3880 in #36 (comment)

Opening very large compressed (tar.gz) files slows the test extremely

We should test file size before download and exclude very large archives
See for example doi:10.17882/42182
which provides a > 30 Gb file

Provenance Properties

Mapping should be extended to cover the selected fields as follows (ref: https://www.w3.org/TR/prov-dc/)

Descriptive metadata	What	abstract, accrualMethod, accrualPeriodicity, accrualPolicy, alternative, audience, bibliographicCitation, conformsTo, coverage, description, educationLevel, extent, format, hasPart, isPartOf, identifier, instructionalMethod, isRequiredBy, language, mediator, medium, relation, requires, spatial, subject, tableOfContents, temporal, title, type
Provenance	Who	contributor, creator, publisher, rightsHolder
Provenance	When	available, created, date, dateAccepted, dateCopyrighted, dateSubmitted, issued, modified, valid
Provenance	How	accessRights, hasFormat, hasVersion, isFormatOf, isVersionOf, license, isReferencedBy, isReplacedBy, references, replaces, rights, source

'provenance' --> This leaves one very special term: provenance. This term is defined as a "statement of any changes in ownership and custody of the resource since its creation that are significant for its authenticity, integrity, and interpretation" [DCTERMS], a definition that corresponds to the notion of provenance for artworks. This term can be considered a link between the resource and any provenance statement about the resource, so it is not included in any of the aforementioned categories.

Change way we model links

Currently we just use the URL to store links. However if we want to check content, we also need mime type, size etc.
This information is available inschema.org, daacite as well as DCAT metadata. in issue #10 I proposed to change method get_html_typed_links. The new method would return a dictionary instead of a URL string only.
Dictionary structure is: {'href':URL,'format':FORMAT, 'rel':REL, 'type':TYPE}

See: https://signposting.org/conventions/#attributes

We should therefore also change JSON queries accordingly..

Metadata element 'Language' should be checked

Structured data embedded in the data page

Strctured might be include in the html page through external javascript.there should be a mechanism to crawl dynamic web page. It looks like there are a few different frameworks (like selenium and scrapy-splash) that you can tie in with Python to do this.

What to do with relative URL paths in e.g. signposting?

Sometimes we have relative paths indicating e.g. in tags paths to object_identifier_content?
Will we accept these as valid identifiers? Or only if they resolve?

Test https://data.neonscience.org/data-products/DP1.00001.001

Improve the flow of the community-standards check

There are several ways of retrieving standards supported by the repository (namespaces, re3data api,standards, provided endpoint). the workflow should be refined by identifying primary and secondary sources.

Implement FsF-A2-01M - metadata preservation of deleted records

We can just start this by using usage of DOIs or other registry-bound PIDs as an indicator for truly preserved records.

Btw. here is an example of a deleted PANGAEA record:
https://doi.pangaea.de/10.1594/PANGAEA.50080

Bugs Found

Commented by @ajaunsen Here are some datasets that caused the evaluator to crash

https://su.figshare.com/articles/Data_for_Does_historical_land_use_affect_the_regional_distribution_of_fleshy-fruited_woody_plants_Arnell_et_al_2019_/10318046 | KeyError: 'object_size'
https://datasets.aida.medtech4health.se/10.23698/aida/brln | AttributeError: 'dict' object has no attribute 'decode'
https://services.fsd.uta.fi/catalogue/FSD3263 | KeyError: 'object_size'
https://usn.figshare.com/articles/Data_Journal_EvolutionaryApplications_RData/7770785 | KeyError: 'object_size'

OpenDAP support

We should add OpenDAP support. This is a very common protocol which is very common in earth and envronmental sciences. .. and easy to implement since it supports content-negotiation :

We just need to send a header content-type: application/vnd.opendap

Schema.org Types Suported

@ajaunsen

Evaluating e.g.
https://www.proteinatlas.org/ENSG00000110651-CD81/cell

Metric 5 claims no JSON-LD is found.

Metric 3 reports:
"INFO: Found JSON-LD schema.org but record is not of type "Dataset\””,

Investigating the content I find there is indeed a ld+json object, but of type=“DataRecord”. Not sure if this is a previous type of schema.org that has become deprecated or what… but the data provider clearly has defined structured data according to some schema.org form.

Handles receive zero scoring in FsF-F1-01D

But should...
example: https://hdl.handle.net/11676/u6vs4ubCc7R6yi4UZjEoSB7l

Metadata Portal

Portals to check metadata of research data:
https://beta.explore.openaire.eu/
http://www.eurogeoss-broker.eu/

further testing required though

Metrics YAML File

New version should be created and include the objects metrics descriptions from v3.0.
Add assessment aspects for each metric.
Update server.ini file

Community vs Multidicipline

https://rdamsc.bath.ac.uk/subject/Multidisciplinary

Are multidisciplinary metadata standards accepted as community-endorsed standards in the context of FAIR?

Improvement based on Signposting the Scholarly Web (http://www.signposting.org/)

By now we only test typed links in the HEAD section of the landing page. We should also test for links included in the header response following the signposting.org convention (e.g., get/head request)
There will be a proposal from @hvdsomp on the final recommendations on implementing link relation types (including linkset) in the data landing page. To test the recommendations, we need (a) pilot repositories (e.g., pangaea) implementing them and fuji assessment should be extended to reflect the final recommendations.

Store namespace URIs and/or prefixes

it seems as if namespaces used in e.g. RDF metadata can be very valuable for several metrics, because we can use them for:

assessing if community specific vocabularies are used
assessing if community specific metadata is used
assessing if provenance metadata is included
etc..
We should introduce a variable (list) which stores this in the FAIRCheck class?

Further we should feed this variable in various methods while checking the metadata.. e.g. when we retrieve dublin core, json-ld (schema.org), datacite and dcat etc.. metadata

re3data (repository records are not detected)

Comments from @ajaunsen

some of the tests seem to rely on extracting information from re3data.org, but this is either not working or not robust enough
examples:

snd.gu.se
This is the dataset I tested:
https://snd.gu.se/en/catalogue/study/snd1038
metric 13 also fails for e.g. DataDOI which is listed with OAI-PMH API on re3data.org:
https://datadoi.ut.ee/handle/33/152

Metadata validation

Metadata (schema.org) should be validated against its schema before the document is considered part of the assessment.

Consistent use of term in the debug message

there are various terms used in the debug message of the response, data vs dataset vs data object;
consistency is required here.

Data file format checks

In current implementation, only file formats of active content urls are considered. there can be a case where the format is specified in the metadata but the data content url is missing. there should be a debug message indicating it.
example of data: ttps://hdl.handle.net/10411/G8MPEI

Enable DCAT metadata check

Implement content negotiation to retrieve XML Metadata

I think we forgot to implement a simple Accept type = text/xml content negotiation procedure...
This should be there. A MetaDataCollectorXML already exists and can be reused

Re3data metadata standards entries do not contain correct schema or namespace URL

Unfortunately re3data does not give back e.g. schema location URL as we would expect this for XML schemas like ABCD or DC, it instead links to a dcc.ac.uk catalogue entry.

This is a problen since the unique identifier for a XML schema surely is its schema location or namespace.

Example out put from Re3data:

r3d:metadataStandard
<r3d:metadataStandardName metadataStandardScheme="DCC">ABCD - Access to Biological Collection Data</r3d:metadataStandardName>
r3d:metadataStandardURLhttp://www.dcc.ac.uk/resources/metadata-standards/abcd-access-biological-collection-data</r3d:metadataStandardURL>
</r3d:metadataStandard>

OpenAIRE Explore/Graph - Potential Registry to Support Automated FAIR Data Assessment.

Source: OpenAIRE EXPLORE
https://beta.explore.openaire.eu/search/dataset?datasetId=datacite____::e95c6f777dcef1498cd1cb16b4574746

API information: https://develop.openaire.eu/api.html#datasets

Example of dataset search: http://api.openaire.eu/search/datasets?doi=10.7925/drs1.duchas_4474990

get_html_typed_links

@huberrob
{'url': href, 'type': l.attrib.get('type'), 'rel': l.attrib.get('rel'), 'profile': l.attrib.get('format')}
Cannot find 'format' attribute. Is this standard attribute of rel?

Change method get_html_typed_links from regex testing to DOM+xpath

The following code does this and would also solve issue #7

def get_html_typed_links(self, rel="item"):
# Use Typed Links in HTTP Link headers to help machines find the resources that make up a publication.
# Use links to find domains specific metadata
datalinks = []

    dom = lxml.html.fromstring(self.landing_html.encode('utf8'))
    links=dom.xpath('/*/head/link[@rel="'+rel+'"]')
    for l in links:
        href=l.attrib.get('href')
        #handle relative paths
        if href.startswith('/'):
            href=self.landing_origin+href
        datalinks.append({'href':href,'format':l.attrib.get('format'), 'rel':l.attrib.get('rel'), 'type':l.attrib.get('type')})
    return datalinks

Metadata uses common semantic resources

This metric should be further extended by distinguishing 'data model' vocabularies (e.g., dcat) from 'content vocabularies' (e.g., pato, envo, P01).

FsF-F2-01M: change sequence of tests

Now we always perform all possible tests (og, schema.org. dc, datacite etc) to check for metadata in a 'from less to more imortant' sequence. But if we e.g. already have the complete set of required metadata from schema.org, there is no need anymore to call the datacite API or so.

I would therefore propose the following sequence:

First check for embedded schema.org metadata. If metadata is not complete check datacite. If is again not complete try some content negotiation to get some xml-rdf and get e.g. DCAT records.

Then get embedded DC, OG etc. as a fallback option

Change scoring for metadata FsF-F2-01M

sometimes metadata is found by the check and 'no metadata' is returned

Evaluations fail with KeyError: 'content-type'

Following examples of datasets from 3 different repositories fail with this error:

https://usn.figshare.com/articles/Data_Journal_EvolutionaryApplications_RData/7770785
https://data.dtu.dk/articles/Data_for_the_paper_A_dual_reporter_system_for_investigating_and_optimizing_translation_and_folding_in_E_coli_/10265420
https://su.figshare.com/articles/Data_for_Does_historical_land_use_affect_the_regional_distribution_of_fleshy-fruited_woody_plants_Arnell_et_al_2019_/10318046

File Format Test

what is the difference between "file_type": null and "mime_type" with reference to the generated response?
archive format should be noted as part of response (related list: https://en.wikipedia.org/wiki/List_of_archive_formats)
Multiple file formats may be specified.

Open or restricted access to CC licenced data

Found here:

https://creativecommons.org/faq/#can-i-share-cc-licensed-material-on-password-protected-sites

Can I share CC-licensed material on password-protected sites?
Yes. This is not considered to be a prohibited measure, so long as the protection is merely limiting who may access the content, and does not restrict the authorized recipients from exercising the licensed rights. For example, you may post material under any CC license on a site restricted to members of a certain school, or to paying customers, but you may not place effective technological measures (including DRM) on the files that prevents them from sharing the material elsewhere.

Comparison of Assessment Tools

@ajaunsen to provide the list of identifiers with large differences in terms of scores generated by the tool.

Check standardized communications protocol

Implement basic metric:
A1. (meta)data are retrievable by their identifier using a standardized communications protocol

Should check if common web protocols are used eg. http, https, ftp etc..

based on URL schem part:

https://www.w3.org/2001/tag/doc/SchemeProtocols.html

Schema.org (publisher vs provider)

publisher -(https://schema.org/publisher) The publisher of the creative work.
provider (https://schema.org/provider): The service provider, service operator, or service performer; the goods producer. Another party (a seller) may offer those services or goods on behalf of the provider. A provider may also serve as the seller.

Are they same in the context of dataset?

F-UJI Test Service

Host an instance of F-UJI service over https (through pangaea production server)
@uschindler you inputs are needed here, in near future ;)

Metadata External Aggregators

Expand current metadata search from external resources by including b2find search and content-negotiation (rdf-related accept types).

OpenSearch support?

Shall we enable OpenSearch Support?

I found e.g. this: http://doi.org/10.25914/5eaa30de53244

which has a OpenSearch path indicated in header as:

When using the OpenSearch pattern and the data set title in a query:

https://geonetwork.nci.org.au/geonetwork/srv/eng/rdf.search?any=Magnetotelluric survey (Level1 concatenated, resampled, rotated NetCDF time series) of the AusLAMP Musgraves Province, 2016 to 2018&hitsPerPage=1

it (somehow strangely) returns a DCAT XML...

Hmmm..

python3 -m fuji_server
2020-07-23 15:02:20,861 - root - INFO - Total metrics defined: 14
2020-07-23 15:02:20,902 - root - INFO - Total SPDX licenses : 422
2020-07-23 15:02:20,902 - root - INFO - Total re3repositories found from datacite api : 199
2020-07-23 15:02:20,902 - root - INFO - Total subjects area of imported metadata standards : 97
2020-07-23 15:02:20,902 - root - INFO - Total LD vocabs imported : 1021
2020-07-23 15:02:20,902 - root - INFO - Total default namespaces specified : 20
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/connexion/utils.py", line 120, in get_function_from_name
    function = deep_getattr(module, attr_path)
  File "/usr/local/lib/python3.7/site-packages/connexion/utils.py", line 68, in deep_getattr
    return functools.reduce(getattr, attrs, obj)
AttributeError: module 'fuji_server.controllers' has no attribute 'authorization_controller'

If I remove last line from yaml:

fuji/fuji_server/yaml/swagger.yaml

Line 596 in 2913bbe

 x-basicInfoFunc: fuji_server.controllers.authorization_controller.check_UserLogin 

It will start, but only 404 responses are returned.

2020-07-23 14:44:49,253 - root - INFO - Total metrics defined: 14
2020-07-23 14:44:49,269 - root - INFO - Total SPDX licenses : 422
2020-07-23 14:44:49,269 - root - INFO - Total re3repositories found from datacite api : 199
2020-07-23 14:44:49,269 - root - INFO - Total subjects area of imported metadata standards : 97
2020-07-23 14:44:49,269 - root - INFO - Total LD vocabs imported : 1021
2020-07-23 14:44:49,269 - root - INFO - Total default namespaces specified : 20
 * Serving Flask app "__main__" (lazy loading)
 * Environment: production
   WARNING: This is a development server. Do not use it in a production deployment.
   Use a production WSGI server instead.
 * Debug mode: off
2020-07-23 14:44:50,370 - werkzeug - INFO -  * Running on http://0.0.0.0:1071/ (Press CTRL+C to quit)
2020-07-23 14:45:03,840 - werkzeug - INFO - 127.0.0.1 - - [23/Jul/2020 14:45:03] "GET / HTTP/1.1" 404 -
2020-07-23 14:46:04,943 - werkzeug - INFO - 127.0.0.1 - - [23/Jul/2020 14:46:04] "GET /uji/api/v1/swagger.json HTTP/1.1" 404 -
2020-07-23 14:48:18,504 - werkzeug - INFO - 127.0.0.1 - - [23/Jul/2020 14:48:18] "GET / HTTP/1.1" 404 -
2020-07-23 14:53:23,659 - werkzeug - INFO - 127.0.0.1 - - [23/Jul/2020 14:53:23] "GET /evaluate HTTP/1.1" 404 -

Any suggestions?

Support for eprints metadata

Not sure if we need this, but sometimes I see embedded eprints metadata:

e.g. in http://doi.org/10.5255/UKDA-SN-854270

pangaea-data-publisher / fuji Goto Github PK

fuji's People

Contributors

Stargazers

Watchers

Forkers

fuji's Issues

Recommend Projects

Recommend Topics

Recommend Org