Giter VIP home page Giter VIP logo

gin-doi's People

Contributors

achilleas-k avatar cgars avatar mpsonntag avatar

Stargazers

 avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

gin-doi's Issues

[docker] Make 'assets' available to host

The "DOI registration" service and the static "DOI hosting" service both share the same "assets".

Still, when the assets of the project are updated, the assets need to be manually copied from this project to where the hosting of the static DOI pages is taking place.

Making the assets available to the host running this project via docker would make this copy step no longer required and would ensure, that the assets of the "DOI registration" and the static "DOI hosting" service are the same when the project is deployed.

Remove registration request fail on license content check

Currently a DOI registration request fails,

  • if a license can be identified and checked against common licenses in the licenses folder at gin.g-node.org/G-Node/Info
  • and if the content of the request LICENSE file differs in any way from the content of the G-Node reference license file identified above.

This leads to the case that even a differing whitespace in the license content text will fail a registration request.

Since the formatting of license texts can vary while still being valid licenses, this check should be removed.

Notify on submodules

Add a notification to the admins if a DOI request repository contains submodules.

datacite: set funderName as default

When the funding information contains no split character but has an entry, set funderName as default.

Currently it seems to default to awardNumber.

Automate registration confirmation email

We can semi-automate the confirmation email message that we send to the requesting user. Currently, I send it manually when the registration is finished and I base it on a template email that includes fields for the name of the user, the title of the dataset, and the DOI.

The service could create a link that encodes the above information (encrypted) in the same way the GOGS service encodes the request it sends to GIN DOI. The link can be included in the email that notifies us of a request. When the link is clicked, it loads a page with a text box pre-filled with the mail template with the fields filled in. We can then review or edit the email content and subject and press a button to have the DOI service send the email message to the user.

Requirements already met:

  • Email sending is already set up.
  • Encrypting variables into URL fields: The service can already decrypt these using the key it shares with GOGS/GIN-Web. We can use the same key and libgin functions to encrypt the email vars.

New requirements:

  • Form page for the email text that sends the email on submit.
  • Email notification template (that's sent to us) with link to form.

Cloning fails on server

I think this might be related to the server configuration or the network drive. We've had reports of this being an issue with the gin-cli. Local tests work, so I'll have to reproduce the server environment or safely test on the server somehow.

Automatic zip creation: unlock annex files

During the current automatic zip file creation, it can happen that some files are added as read-only. This seems to happen for annexed files that were uploaded using a previous version of the gin client and which remain locked and read only after annex uninit.

The read-only status of these files might catch some users off guard, so it would be nice to make sure that all files are added to the zip file as unlocked and read-write.

doi request fails - existing datacite.yml not found

Describe the bug

If I click on the button "request doi" in the repository https://gin.g-node.org/mikapfl/read_di_unfccc , I get the error message:

DOI request failed

The DOI file is missing or not valid. See the messages below for specific issues with the provided data.
Also, please see the DOI guide for detailed instructions.

No datacite.yml file found in repository

However, the datacite.yml definitely exists: https://gin.g-node.org/mikapfl/read_di_unfccc/src/main/datacite.yml

The problem might be related (wild guess) to the fact that this repository has no "master" branch and uses a "main" branch instead.

Filter empty RelatedIdentifier

When no DOI is given in a reference, an empty RelatedIdentifier entry is included. This creates an error at registration.
Could this be filtered?

Enhancements to links between different versions of the same dataset

Following up from #57, we should also consider the case where a dataset has more than two versions. This hasn't come up yet, but there's a chance it will.

In the current system "middle" versions will both have a notice that a new version is available and a link to the previous one at the bottom. This will require the user to follow the links up the chain to the newest version.
It would be preferable to link to the newest version from all older pages and it would also be nice if the latest version listed all previous versions. If we want to do this, we'd have to make some more changes so that the landing page creation function follows the chain of references to link the latest one (if it's old) or list all available versions (if it's the newest).

[util.go] Potentially unused code cleanup

The following functions do not seem to be used and could be removed if they are not required for legacy purposes or future development

  • util.go:readBody() ... seems unused
  • util.go:makeUUID() ... seems unused; probably superseded by different DOI ID scheme.
  • util.go:ReferenceDescription() ... seems unused and superseded by util.go:FormatReferences()
  • util.go:ReferenceSource() ... seems unused and superseded by util.go:FormatReferences()
  • util.go:ReferenceID() ... seems unused and superseded by util.go:FormatReferences()

Annex content download fails silently leading to incomplete zip files

It happened a couple of times now, that if the download rate of annex content drops below byte/s, the content download stops silently, continuing with the next file. This can lead to files without content which is an issue when the zip file is created, containing only files with the annex object reference instead of the full files.

If another content download is initiated, the content download of these files will continue where the download has stopped before.

As a quick fix, a second round of content download should be added to the download procedure. If files have not been fully downloaded, they should finish during the second round. If all files had been downloaded in the first place, nothing will happen.

As a long term solution content download fails have to be identified; if content is still missing after the second round of downloads, admins should be notified via email to look into the issue.

Admin warning on ResourceType other than 'Dataset'

We usually want people to register their DOI as ResourceType 'Dataset' even though other ResourceTypes are valid. To not miss a ResourceType other than 'Dataset' add a corresponding admin warning after the DOI preparation is complete.

Try to validate license matching

Try to match the license string in the datacite.yml with the LICENSE file in the repository based on the standard description and text in the file (or the first line of the file). If it doesn't match, simply show a warning and ask the user to confirm that they actually do match.

Fails to create issue comment after issue is created

The issue creation added in #61 works for the creation of the issue, but adding a comment to report on problems after the initial issue is created doesn't work.

Error message: Failed to create issue or comment on XML repo: [500] {"message":"issue does not exist [id: 0, repo_id: 1338, index: 298]","url":"https://github.com/gogs/docs-api"}

Check storage before clone, get-content, and zip

Before each operation that potentially downloads enough data to fill the target storage, the service should check the available storage.

There are three steps where this would be useful:

  1. git clone (gin get)
  2. git annex get (gin get-content)
  3. zip

Step 1 might be an issue since it's not straightforward to know the size of the repository before cloning. We might have to solve that by adding the functionality to report repository sizes on GIN Web (Gogs).

Step 2 is pretty straightforward. For one, git annex will refuse to download a file if there is not enough local storage. That said, if several files are being downloaded, we might reach near-capacity before the limit is hit, so the service should check for the entire download size before running get-content. Git annex provides this information via git annex info (which also supports --json output).
When we add the repo size reporting functionality to GIN Web, this step could be merged into step 1.

Step 3 essentially means we would require twice the space required for step 2. Assuming no compression (worst case), the zip file would take up the same amount of storage as the repository, so the storage needed to clone & get-content will be doubled when creating the zip file.

Essentially, if we can know ahead of time the storage requirements for a repository, we can safely clone and zip if there is twice as much available space.

NOTE: This becomes tricky when multiple repositories are being registered simultaneously. We should consider having workers report the storage space they're planning to use to the other workers.

Warn of failure to validate license title-content match

Followup to #55

The license validation function assumes the licenses match when it's unable to validate the text against the name. Instead, it should add a warning to the admin notification that a manual check is needed.

func checkLicenseMatch(expectedTextURL string, licenseText string) bool {
expectedLicenseText, err := readFileAtURL(expectedTextURL)
if err != nil {
// License isn't known or there was a problem reading the file in the
// repository.
// Return positive response since we can't validate automatically.
log.Printf("Can't validate License text. Unknown license name in datacite.yml: %q", expectedTextURL)
return true
}
return string(expectedLicenseText) == licenseText
}

Unify navbar rendering template

Top navbar for landing pages and upcoming keyword and front page should be unified and embedded from the same template fragment.

Updates to existing requests

The service should keep a record of open requests and block the user from resubmitting a request from the same repository while a request is still open unless the source repository has changed. The record should be a database (or a flat file) with the following information:

  • Repository ID
  • Reserved/assigned DOI
  • Commit hash

When a new request comes in, the workflow would then be:

  • Check if the repository ID is already in the file.
    • If it's not, it's a new request and the request continues normally.
  • If the repository ID is in the file and the commit hash is the same, inform the user that the request is still pending and don't start a new registration.
  • If the commit hash is different, update the existing request with the new data and inform the user that their request has been updated without changing the reserved DOI.

Datacite validation functionality consolidation

The datacite.yml content is validated in two different locations: partially in the readRepoYAML function and partially in the validateDataCiteValuesfunction. Issues returned from both functions lead to a registration rejection.

Move the validation specific code from readRepoYAML to validateDataCiteValues. readRepoYAML then only returns an error, if the datacite.yml file is not a valid YAML file and the content cannot be unmarshalled. The validateDataCiteValues function will validate the datacite.yml content and collect and return any and all Datacite specific issues.

Links between versions of the same dataset

When a new version of an existing dataset is linked, we add the information to the XML files of both the new and the old datasets using related identifiers and the reftype "IsNewVersionOf" and "IsOldVersionOf" respectively. The landing page generator should use this information to add:

  • A highly visible notice on the older version that a new version exists and a link to it.
  • A small notice on the new version that older versions of the dataset exist.

Allow all reftypes through the validator

All valid DataCite reftypes should be allowed through the validator. We should mark some that are less appropriate to warn us for manual review though.

We should add a warning for the admins to do a manual review if any reftype is not IsSupplementTo, which is the common way of referencing a manuscript.

Add license title and link validation

It happens quite frequently that license title and link do not match in the datacite.yml which could be caught by adding a validation similar to the license "title" and "expected content" validation.

Collect DOI request failed messages

Requesting a DOI can fail due to a couple of issues:

  • the license file cannot be read.
  • the content of the datacite.yaml is missing some required fields.
  • the datacite.yaml file contains unsupported values.

Each of these cases leads to different error pages, potentially leading the user to submit a DOI request three times before all issues are resolved.

Change the procedure to always check all three of the above described conditions, collect all issues and show them to the user at once, before failing the DOI request.

Expose functionality through command line for admin tasks

The main functionality of preparing a repository for registration, as well as sub-parts of it, should be exposed via the command line. This would be useful for:

  • Regenerating landing pages (for template modifications).
  • Validating zip file consistency.
  • Updating metadata when user adds info (e.g., manuscript reference).

Keyword pages

Each keyword on a landing page links to a keyword page that should list all the published datasets that share that keywords.

  • Generate each keyword page, listing datasets in reverse chronological order.
  • Generate a page that lists all known keywords, sorted by keyword popularity.
  • Link to the keyword list page in the header.

Itemize invalid DOI request messages

When the datacite.yaml file contains invalid entries or missing required data, the corresponding messages are displayed to the user as a single string.

Display the messages as list items to make them more convenient to read.

Add ORCID to metadata even when it's missing the prefix

In the datacite.yml, we expect the author ID to be in the form type:id, e.g., ORCID:0000-0001-2345-6789. Sometimes a user will enter their ORCID but without the prefix. We should allow this and assume it's an ORCID when the prefix is missing, since that's the most commonly used author ID type. The service should warn us when this happens so we can manually verify the ID.

Add commit hash to email and issue

When a new DOI request is being processed, add the HEAD commit hash to the gin issue and the admin email to verify and document at which stage the gin repository was submitted as a DOI request.

Very large datasets can cause prep process to hang

A very large dataset can cause parts of the preparation process to hang indefinitely. The most recent occurrence involved three very large datasets (hundreds of GiB each, the largest being over 800 GiB) being prepared at the same time. All three workers stopped during the git annex uninit phase. It seems the git-annex process hanged for all three. Killing the processes let the worker queue continue, but all subsequent zip files (for requests that were already in the queue) did not contain any annexed data.

datacite: use different funder split character

Currently the information in funder is split at the comma character. Since this character has been used in the past in the funder name entry, use semi-colon instead.

Keep split on comma as fallback and backwards compatibility.

Detect renamed repositories and broken links in landing pages

This functionality can be a small, constantly running worker (goroutine) in gin-doi that periodically checks the status of dataset landing pages. Broken links can occur for mainly two reasons:

  1. The original source repository was renamed.
  2. The original source repository was deleted.

In the first case, the badge on the repository and the link in the fork on GIN will be intact. The link to the source repository on the landing page should be updated. This should be done by updating the XML file and regenerating the landing page through the make-html command. The published metadata should also be updated.

In the second case, the link should be removed. The procedure should be the same: Update the XML file (remove the relatedIdentifier), regenerate the landing page, and update the published metadata.

A related situation (which hasn't occurred so far) is the owner of the repository making it private. This would also make the DOI fork private. In this case, it might be better to treat it like the second case, but also edit the DOI fork to make it a non-fork (or recreate it).

Use lowercase clone path for creating archive

The repository is cloned for preparation into a directory with the repository name in all lowercase:

go conf.GIN.Session.CloneRepo(strings.ToLower(URI), clonechan)

When preparing the archive, the commands are run without converting the path to lowercase:

repodir := filepath.Join(targetpath, reponame)

If a repository has uppercase characters in the name, this fails.

Block repeated requests to register the same dataset

After submitting a request and before the dataset is finalised, we should have a way to block repeated requests to register the same repository. As it is now, if a user reloads the final request confirmation page, they will trigger a second request as the page loads.

The service should keep a list of ongoing registration jobs and when the user attempts to register a repository while a request is still being processed, they should get an appropriate message.

[util.go] 'FunderName' vs 'Funder': Potential name mixup

The util.go function FunderName is linked via the template.Func map, but seems to be unused.

There is a reference in template.go to funding.Funder which cannot be found in the code. Since funding.Funder appears right next to funding.AwardNumber and there is an AwardNumber function in util.go right next to the FunderName function, funding.Funder should probably be funding.FunderName in template.go.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.