znicholls / cmip6-json-data-citation-generator Goto Github PK

View Code? Open in Web Editor NEW

1.0 1.0 1.0 325 KB

Simple scripts to automatically generate json data citations for CMIP6 data files

Home Page: https://cmip6-json-data-citation-generator.readthedocs.io/en/latest/

License: BSD 2-Clause "Simplified" License

Makefile 1.98% Python 98.02%

climate cmip6 cmip6-data-citation netcdf python

cmip6-json-data-citation-generator's People

Contributors

Stargazers

Watchers

Forkers

simrit1

cmip6-json-data-citation-generator's Issues

Granularities to support

@MartinaSt I still don't understand exactly what you mean by granularities? Do you mean we should be able to handle filenames of the form

* <mip_era>.<activity_id>.<institution_id>.<source_id>

and

* <mip_era>.<activity_id>.<institution_id>.<source_id>.<experiment_id>

Language re jsons, dicts etc. too loose

Requires a general tidy up of names in the repo

Bug Report

Describe the bug

On the landing pages, e.g. https://cera-www.dkrz.de/WDCC/ui/cerasearch/cmip6?input=input4MIPs.CMIP6.ScenarioMIP.UoM.UoM-IMAGE-ssp126-1-2-1.atmos.mon.mole_fraction_of_halon1211_in_air.gr-0p5x360deg.20181127 the citation has ' Version YYYYMMDD'.

@MartinaSt is this how it is meant to be?

Discussion: Add netCDF reading

This would make it possible to generate the citation purely from the file, irrespective of its name or directory (and we could also rename and put in the right path too..). However it makes installation harder as you need a netCDF reader installed in the right environment too. I think it's probably the right move, but one for discussion now.

Correct format of json file

@MartinaSt just want to check that I've correctly understood the format of the json we want to produce. Can you double check the format and my split of ignored, optional, compulsory and compulsory but fixed (i.e. fields that must be there but the content is always the same) fields?

# ignored
{
  "identifier": {
    "identifierType": "URL",
    "id": "http://cera-www.dkrz.de/WDCC/meta/CMIP6/CMIP6.VIACSAB.PCMDI.PCMDI-test-1-0"
  },
  "publisher": "Earth System Grid Federation",
  "publicationYear": "2017",
  "dates": [
    {
      "dateType": "Created",
      "date": "2017-05-03"
    }
  ],
  "language": "en",
  "resourceType": {
    "resourceTypeGeneral": "Dataset",
    "resourceType": "Digital"
  },
  "formats": [
    {
      "format": "application/x-netcdf"
    }
  ],
  "rightsList": [
    {
      "rightsURI": "http://creativecommons.org/licenses/by-sa/4.0/",
      "rights": "Creative Commons Attribution 4.0 International License (CC BY-SA 4.0)"
    }
  ],
  "descriptions": [
    {
      "descriptionType": "Abstract",
      "text": "Coupled Model Intercomparison Project Phase 6 (CMIP6) data sets. These data have been generated as part of the internationally-coordinated Coupled Model Intercomparison Project Phase 6 (CMIP6; see also GMD Special Issue: http://www.geosci-model-dev.net/special_issue590.html). The simulation data provides a basis for climate research designed to answer fundamental science questions, and the results will undoubtedly be relied on by authors of the Sixth Assessment Report of the Intergovernmental Panel on Climate Change (IPCC-AR6).CMIP6 is a project coordinated by the Working Group on Coupled Modelling (WGCM) as part of the World Climate Research Programme (WCRP). Phase 6 builds on previous phases executed under the leadership of the Program for Climate Model Diagnosis and Intercomparison (PCMDI) and relies on the Earth System Grid Federation (ESGF) and the Centre for Environmental Data Analysis (CEDA) along with numerous related activities for implementation. The original data is hosted and partially replicated at a federated collection of data nodes, and most of the data relied on by the IPCC is being archived for long-term preservation at the IPCC Data Distribution Centre (IPCC DDC) hosted by World Data Centre for Climate (WDCC) at DKRZ.The project includes simulations from about 90 global climate models and around 40 institutions and organizations worldwide. - Project website: https://pcmdi.llnl.gov/CMIP6  The Earth System Model PCMDI-test 1.0 (This entry is free text for users to contribute verbose information), released in 1989, includes the components:  atmos: Earth1.0-gettingHotter (360 x 180 longitude/latitude; 50 levels; top level 0.1 mb), land: Earth1.0, ocean: BlueMarble1.0-warming (360 x 180 longitude/latitude; 50 levels; top grid cell 0-10 m), seaIce: Declining1.0-warming (360 x 180 longitude/latitude). The model was run by the Program for Climate Model Diagnosis and Intercomparison, Lawrence Livermore National Laboratory, Livermore, CA 94550, USA (PCMDI) in nominal resolutions: atmos: 1x1 degree, land: 1x1 degree, ocean: 1x1 degree, seaIce: 1x1 degree."
    }
  ],
# compulsory
  "creators": [
    {
      "creatorName": "Taylor, Karl E.",
      "givenName": "Karl E.",
      "familyName": "Taylor",
      "email": "[email protected]",
      "nameIdentifier": {
        "schemeURI": "http://orcid.org/",
        "nameIdentifierScheme": "ORCID",
        "pid": "0000-0002-6491-2135"
      },
      "affiliation": "Lawrence Livermore National Laboratory"
    }
  ],
  "titles": [
    "PCMDI PCMDI-test1.0 model output prepared for CMIP6 VIACSAB"
  ],
# compulsory but fixed
  "subjects": [
    {
      "subject": "CMIP6.VIACSAB.PCMDI.PCMDI-test-1-0",
      "schemeURI": "http://github.com/WCRP-CMIP/CMIP6_CVs",
      "subjectScheme": "DRS"
    },
    {
      "subject": "climate"
    },
    {
      "subject": "CMIP6"
    }
  ],
# optional 
 "contributors": [
    {
      "contributorType": "ContactPerson",
      "contributorName": "Jungclaus, Johann",
      "givenName": "Johann",
      "familyName": "Jungclaus",
      "email": "[email protected]",
      "nameIdentifier": {
        "schemeURI": "http://orcid.org/",
        "nameIdentifierScheme": "ORCID",
        "pid": "0000-0002-3849-4339"
      },
      "affiliation": "Max-Planck-Institut fuer Meteorologie"
    },
    {
      "contributorType": "ResearchGroup",
      "contributorName": "Max-Planck-Institut fuer Meteorologie (MPI-M)"
    }
  ],
  "relatedIdentifiers": [
    {
      "relatedIdentifier": "10.5194/gmd-10-2247-2017",
      "relatedIdentifierType": "DOI",
      "relationType": "Cites"
    }
  ],
  "fundingReferences": [
    {
      "funderName": "Federal Ministry of Education and Research (BMBF)",
      "funderIdentifier": "http://doi.org/10.13039/501100002347",
      "funderIdentifierType": "Crossref Funder ID"
    }
  ]
}

Check special character support

check the support for 'utf-8' in the json template and
the json result?
The authors or even their affiliations might be 'utf-8' (non-English).
To give you a German example: One of our most common last names is:
'Müller'.

Defining the workflow

Probably the right starting point

Go onto the GUI and make one entry for your files
- this should mean that users enter all the contributors, institutes, references etc. they need
- the tool should point people to the GUI before they do anything else and explain what steps they need to do
Download the json for your file that you've entered
Follow a guide (to be written) which explains how to modify your json so it can be used as a template to generate jsons for all of your other data files which you wish to cite
Run the tool over the files you wish to generate jsons for
Upload the jsons to the data citation server via the API

Installation error

Hi @znicholls
At last I have checked out the code. As I want to use the existing python2.7 on our HPC, I had to install missing packages in a local path:

python setup.py install --prefix=/home/dkrz/k204082/.local/

I got the following error:

...
Searching for pytest
Best match: pytest cov-2.5.1
Downloading https://files.pythonhosted.org/packages/24/b4/7290d65b2f3633db51393bdf8ae66309b37620bc3ec116c5e357e3e37238/pytest-cov-2.5.1.tar.gz#sha256=03aa752cf11db41d281ea1d807d954c4eda35cfa1b21d6971966cc041bbf6e2d
Processing pytest-cov-2.5.1.tar.gz
Writing /tmp/easy_install-ZAjOEj/pytest-cov-2.5.1/setup.cfg
Running pytest-cov-2.5.1/setup.py -q bdist_egg --dist-dir /tmp/easy_install-ZAjOEj/pytest-cov-2.5.1/egg-dist-tmp-l4JpD2
warning: no files found matching '.isort.cfg'
warning: no files found matching '.pylintrc'
warning: no previously-included files matching '*.py[cod]' found anywhere in distribution
warning: no previously-included files matching '__pycache__' found anywhere in distribution
warning: no previously-included files matching '*.so' found anywhere in distribution
removing '/mnt/lustre01/pf/k/k204082/.local/lib/python2.7/site-packages/pytest_cov-2.5.1-py2.7.egg' (and everything under it)
creating /mnt/lustre01/pf/k/k204082/.local/lib/python2.7/site-packages/pytest_cov-2.5.1-py2.7.egg
Extracting pytest_cov-2.5.1-py2.7.egg to /mnt/lustre01/pf/k/k204082/.local/lib/python2.7/site-packages
pytest-cov 2.5.1 is already the active version in easy-install.pth

Installed /mnt/lustre01/pf/k/k204082/.local/lib/python2.7/site-packages/pytest_cov-2.5.1-py2.7.egg
error: Could not find required distribution pytest

Use netCDF file attributes to fill out json template

Currently using filename only, this is not good enough

Restructuring

To change:

use nc file attributes, not filenames
use marshmallow or something like https://stackoverflow.com/questions/3262569/validating-a-yaml-document-in-python or the below as validation schema for json files

from marshmallow import Schema, fields


class AssertSchema(Schema):
    variable = fields.String(required=True)
    expected = fields.Float(required=True)
    threshold = fields.Float()
    year = fields.Integer(default=2100)
    region = fields.String(default='GLOBAL')


class TestSchema(Schema):
    name = fields.String(required=True)
    start_year = fields.Integer(default=1750)
    end_year = fields.Integer(default=2100)
    asserts = fields.Nested(AssertSchema, many=True) 

class CollectionSchema(Schema):
    base_config = fields.String(required=True)
    tests = fields.Nested(TestSchema, many=True)

validated = CollectionSchema(strict=True).load(config)

Instructions for uploading

Is your feature request related to a problem? Please describe.

We need some instructions for uploading, especially as this requires running with Python2.

Defining the task

@MartinaSt let's try again haha (I'm glad we worked this out now, better late than never right?)

First question, if all the input4MIPs data citations are done, we can forget about that. If not, then I think we need to look at how to map the information in the input4MIPs files (as specified here) into your citation tool.

Second question. What does the tool actually need to do? It should:

take a model output file, with filename as given by CMIP6 specs
take a template yaml file with information about who generated the date file, funders etc.
combine the two to generate a json file which fits with your data citation system
- in this step it uses information from the model file to fill out the rest of the fields required to generate the json from the yaml
upload the json file to the data citation server via your API

Third question. Does the tool need to look at the file metadata or should it all be contained in the filename? My current guess is it has to look at the model metadata.

Fourth question. Does the tool need to look at the directory structure or should it all be contained in the file metadata? My current understanding is that the directory structure contains information only from the file metadata hence looking at the directory structure is not necessary.

Fifth question. To aid users, we should have some sort of validation of the input yaml and json files which gives useful messages if there are errors?

Last (and most important) questions. If this gets built, who will use it? How much time will it save them and when does it need to be ready by? My impression is that all the groups submitting data to CMIP6 could use it. It would save them all something in the order of hours as they don't have to worry about how formatting the json correctly, entering everything by hand via the GUI or making sure they have citations for all their files. However it won't save much more as it doesn't do the addition of people/institutions via the GUI and you still (obviously) have to manually enter who did what (i.e. which yaml template should be used for which files). As far as I can tell it needs to be ready by end of August so that most (acknowledging some are already done) groups can use it to create data citations as their results come out.

Let's discuss this all in one big thread for now then I will split into smaller issues as appropriate.

Testing upload

@MartinaSt Do you want to include uploading in this repository? If yes, is there a way we can test that the upload works? The options I'm thinking of are:

actually do an upload with a fake name, that you can easily find and delete/ignore?
test that the json is the right format and test that the upload would be done, but don't actually do it?