coecms / clef Goto Github PK

View Code? Open in Web Editor NEW

7.0 6.0 3.0 1008 KB

Home Page: https://clef.readthedocs.io

Python 99.75% Shell 0.10% Makefile 0.14%

cmip climate-data python3

clef's People

Contributors

Stargazers

Watchers

Forkers

lewisjared znicholls scottwales

clef's Issues

Local Search

Currently the search works by calling out to ESGF then matching search results against file checksums. Unfortunately the ESGF catalogue is not comprehensive - old versions of datasets get removed

Add a way to search the local paths via MAS to retrieve datasets that have been removed from ESGF

Challenges:

Version information is part of a dataset, not a file (one file can be a member of multiple dataset versions), so this must be retrieved from the DRS. The DRS for ua6 is unusable, but now that Synda is being used in al33 and oi10 that makes a local search possible.

The DRS entries are made up of both files and symlinks. It is not possible for the database to resolve symbolic links (this has to be done as part of the filesystem crawling) so for the moment we cannot link a DRS entry against the file metadata.

Implementation plan:

Add a new view drs_path as a generic DRS entry, with columns path_id, project. The project column will be used for SQLalchemy joined table inheritance
Add views for drs_path_cmip5, drs_path_cmip6, drs_path_cordex etc. as needed with each catalogue's components generated from the matching path
Add a python function that runs a search on these views given a set of constraints
Add a CLI option to search the local DRS entries instead of ESGF

raise search limit

clef cmip5 --variable tos --experiment historical returns

Too many results 1219, try limiting your search:le tos --experiment historical
https://esgf-data.dkrz.de/search/cmip5-dkrz?query=&distrib=True&replica=False&latest=True&project=CMIP5&experiment=historical&cmor_table=Omon&variable=tos

This is a fairly common search, if the data was daily we'll have even more files, if it doesn't use any other kind of issue I propose to raise the limit substantially. I'll run some test to find a reasonable number.

set permissions for new monthly log files

The acts don't work when a new log monthly file is created, so I need to encode this into clef.
Would also be good to find a better location for the files consistent with the collection location

esgf local is omitting results

Search:

$ esgf local --experiment rcp85 --ensemble r1i1p1 --time_frequency day --debug --model CNRM% --variable rlds --all-versions
/rlds_day_CNRM-CM5_rcp85_r1i1p1_20160101-20201231.nc
/rlds_day_CNRM-CM5_rcp85_r1i1p1_20260101-20301231.nc
/rlds_day_CNRM-CM5_rcp85_r1i1p1_20310101-20351231.nc
/rlds_day_CNRM-CM5_rcp85_r1i1p1_20910101-20951231.nc
/rlds_day_CNRM-CM5_rcp85_r1i1p1_20960101-21001231.nc

Full list in the directory is

rlds_day_CNRM-CM5_rcp85_r1i1p1_20060101-20101231.nc  rlds_day_CNRM-CM5_rcp85_r1i1p1_20310101-20351231.nc  rlds_day_CNRM-CM5_rcp85_r1i1p1_20510101-20551231.nc  rlds_day_CNRM-CM5_rcp85_r1i1p1_20710101-20751231.nc
rlds_day_CNRM-CM5_rcp85_r1i1p1_20110101-20151231.nc  rlds_day_CNRM-CM5_rcp85_r1i1p1_20360101-20401231.nc  rlds_day_CNRM-CM5_rcp85_r1i1p1_20560101-20601231.nc  rlds_day_CNRM-CM5_rcp85_r1i1p1_20860101-20901231.nc
rlds_day_CNRM-CM5_rcp85_r1i1p1_20160101-20201231.nc  rlds_day_CNRM-CM5_rcp85_r1i1p1_20410101-20451231.nc  rlds_day_CNRM-CM5_rcp85_r1i1p1_20610101-20651231.nc  rlds_day_CNRM-CM5_rcp85_r1i1p1_20910101-20951231.nc
rlds_day_CNRM-CM5_rcp85_r1i1p1_20260101-20301231.nc  rlds_day_CNRM-CM5_rcp85_r1i1p1_20460101-20501231.nc  rlds_day_CNRM-CM5_rcp85_r1i1p1_20660101-20701231.nc  rlds_day_CNRM-CM5_rcp85_r1i1p1_20960101-21001231.nc

So we're missing lots of matches

Make esgf search more modular

ESGF search is a bit of a mess, make it more simple to use so it can be matched against the MAS database as we like

CMIP6 activity_id can be a list of terms

Just noticed this in the documentation:
an activity_id identifying the responsible “MIP”. In a few cases it may be appropriate to include multiple activities in the activity_id (separated by single spaces). An example of this is “LUMIP AerChemMIP” for one of the CMIP6 land-use change experiments.

It's not urgent since currently they are only publishing 'CMIP' (DECK) output, but we'll need entually need to treat activity_id as a list!

creating/adapting arguments to CMIP6 DRS

First CMIP6 data is out DRS, facets and global attributes from files are partially different from CMIP5.
I've created a branch called "handling-args"
step1) I've created a mapping using json files, then I've imported all the avaliable vocabularies for CMIP6.
step2) for simplicity I've duplicated esgf.py and model.py plan is to adapt them to CMIP6, so we can see better similarities and differences and decide if keeping this as a final approach or having a cMIP5/6 "switch" but only one code for both. NB also the server is different!
step3) use the vocabularies to check for validity of arguments passed by users.

Data is not yet on MAS still I'm adding modified views for cmip6 in db/tables.sql

experiment_family not working properly

issue with convert period function when nanosecond in timestamp

File "/home/581/pxp581/.local/lib/python3.6/site-packages/clef/code.py", line 255, in convert_periods
freq=freq[frequency]))
File "/g/data3/hh5/public/apps/miniconda3/envs/analysis3-18.04/lib/python3.6/site-packages/pandas/core/indexes/datetimes.py", line 2749, in date_range
closed=closed, **kwargs)
File "/g/data3/hh5/public/apps/miniconda3/envs/analysis3-18.04/lib/python3.6/site-packages/pandas/core/indexes/datetimes.py", line 381, in new
ambiguous=ambiguous)
File "/g/data3/hh5/public/apps/miniconda3/envs/analysis3-18.04/lib/python3.6/site-packages/pandas/core/indexes/datetimes.py", line 476, in _generate
start = Timestamp(start)
File "pandas/_libs/tslibs/timestamps.pyx", line 635, in pandas._libs.tslibs.timestamps.Timestamp.new
File "pandas/_libs/tslibs/conversion.pyx", line 275, in pandas._libs.tslibs.conversion.convert_to_tsobject
File "pandas/_libs/tslibs/conversion.pyx", line 484, in pandas._libs.tslibs.conversion.convert_str_to_tsobject
File "pandas/_libs/tslibs/conversion.pyx", line 453, in pandas._libs.tslibs.conversion.convert_str_to_tsobject
File "pandas/_libs/tslibs/np_datetime.pyx", line 121, in pandas._libs.tslibs.np_datetime.check_dts_bounds
pandas._libs.tslibs.np_datetime.OutOfBoundsDatetime: Out of bounds nanosecond timestamp: 2270-01-01 00:00:00

File causing this is one of MIROC5, rcp26, Amon

Add local database table for ESGF facets

Download the list of valid facets from ESGF to a local table.

This would allow facets to be validated for all projects, including CORDEX etc.

It would also allow use of SQL wildcards to select multiple projects, e.g. model='ACCESS%'

Requires:

New database table esgf_facet with columns key, value
Admin command to download facets from ESGF 'https://esgf.nci.org.au/esg-search/search/?offset=0&limit=0&facets=*' and add new entries to the database (on a cron job)
Add validation logic to the cli and api to expand wildcards and raise an error if a facet is not valid

Gracefully handle bad authentication

Currently when a user is not provided the CLI prints a large stack trace. Fix this to present a human readable error message

Incorrect `esgf local` results

I'm getting different results for this example:

esgf local  --experiment historical --variable tas --ensemble r1i1p1 --model % --user pxp581 > out-search.txt

which returns 99 files, I've used model % to imply any model if I do not specify --model flag at all I get no results

The 99 files are for models: MPI-ESM-MR, MPI-ESM-LR, MPI-ESM-P, NorESM1-ME, NorESM1-M and CCMC-CM, for a total of 6 models (in some cases more than one frquency, i.e. different emsembles for model)

The equivalent search with search_replica returns 108 ensembles

datasets appearing both in local and as missing

Just found out this:

~/.local/bin/clef cmip5 --variable tas --experiment historical --table day --ensemble r2i1p1 --model MIROC5
/g/data1b/al33/replicas/CMIP5/combined/MIROC/MIROC5/historical/day/atmos/day/r2i1p1/v20120710/tas/
('cmip5.output1.MIROC.MIROC5.historical.day.atmos.day.r2i1p1.v20120710',)
The following datasets are not yet available in the database, but they have been requested or recently downloaded
cmip5.output1.MIROC.MIROC5.historical.day.atmos.day.r2i1p1.v20120710. status: 'done'

The MIROC5 data has already been downloaded and is also in the database, still is considered missing from the tool. I reckon this happens because the directory with "combined" doesn't match the dataset_id "output1"

I've printed out the dataset_id for the missing files, it then gets recognised by searching the queue status online. It shouldn't even go there since that's check only dataset_ids which are "missing"

I imagine this would happen if somewhere we build the dataset_id from the path, but I can't find anywhere this is happening in the code

keeping free text query?

I'm running some tests on free text query and I'm getting many mre results each time that by specifying the exact constraints for the values, I'm wondering if is it worth keeping this functionality or if it's just likely to confuse the users or produce many more results that if they're obliged to define the arguments. Another issue is that we can't validate what they passed against the vocabularies.

Dots and dashes in model names

This is a very minor comment, but I've noticed that at the command line clef works fine with either dashes or dots between the numbers in model names, e.g.
CSIRO-Mk3.6.0
CSIRO-Mk3-6-0

whereas when using import clef.code within a script it only accepts dots.

It probably wouldn't really matter except that the official model ids use dashes, so your first instinct as a user is to use CSIRO-Mk3-6-0.

Issue in ESGF response for latest==False

Running this search in ESGF:

https://esgf-data.dkrz.de/esg-search/search?fields=checksum%2Cid%2Cdataset_id%2Ctitle%2Cversion&offset=0&limit=1000&distrib=True&replica=False&latest=False&type=File&format=application%2Fsolr%2Bjson&project=CMIP5&experiment=historical&model=ACCESS1.0&cmor_table=Amon&variable=tasmin

corresponding to

clef cmip5 --variable tasmin --experiment historical --table Amon -m ACCESS1.0 --all-versions

produce results as if the variable is not specified, running with latest=True is fine

If you look at the json response for both, with latest True you get only variable=['tasmin']
with latest=False: "variable":["cl",
"cli",
"clivi",
"clt",
"clw",... all the variables in the same dataset

I will try to filter this using the 'id' filed which should contain ._ in it as for example:
"id":"cmip5.output1.CSIRO-BOM.ACCESS1-0.historical.mon.atmos.Amon.r1i1p1.v20120727.tasmin_Amon_ACCESS1-0_historical_r1i1p1_185001-200512.nc_0|esgf.nci.org.au",

ACCESS1.0 files not being returned

The following search is returning no output

(clef-test) saw562@vdi-n4 ~ $ clef cmip5 --variable tasmin --experiment historical --table Amon -m ACCESS1.0

Everything available on ESGF is also available locally
(clef-test) saw562@vdi-n4 ~ $

CMIP6 ESGF search doesn't include variable

Just discovered this:

WHERE extended_metadata.variable ILIKE ANY (%(param_1)s)
INFO:sqlalchemy.engine.base.Engine:{'param_1': ['tntmp']}
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): esgf-node.llnl.gov
DEBUG:urllib3.connectionpool:https://esgf-node.llnl.gov:443 "GET /esg-search/search?fields=checksum%2Cid%2Cdataset_id%2Ctitle%2Cversion&offset=0&limit=1000&distrib=True&replica=False&type=File&format=application%2Fsolr%2Bjson&activity_id=CMIP&experiment_id=amip&table_id=CFmon&project=CMIP6 HTTP/1.1" 200 3971

while I passed the tntmp value for variable is not included in the ESGF search so I get all the missing variables dataset-ids as a result.

Simply eliminating the "terms" selection based on local values and putting terms['variable_id'] = variable_id fixes the issue. Since we have validations we should eliminate this step which risks to exclude anything whhc is not yet available locally, very likely for CMIP6.

renaming esgf_dataset to cmip5_dataset?

Depending on if we're keeping the two datasets arguments and views definition separate it might make more sense to rename this class and view since it looks like there's not a unified esgf_dataset. Alternatively would be esgf2_dataset (as in new version of) instead of cmip6_dataset.

queue status for CMIP5 doesn't discriminate for variables

I've run the following search today:

clef cmip5 -e rcp85 -t day -m MPI-ESM-LR -v rlds

The following datasets are not yet available in the database, but they have been requested or recently downloaded
cmip5.output1.MPI-M.MPI-ESM-LR.rcp85.day.atmos.day.r1i1p1.v20111014 status: 'done'
cmip5.output1.MPI-M.MPI-ESM-LR.rcp85.day.atmos.day.r2i1p1.v20111014 status: 'done'
cmip5.output1.MPI-M.MPI-ESM-LR.rcp85.day.atmos.day.r3i1p1.v20111014 status: 'done'

When looking for the variables in al33 couldn't find neither.
I think this happens because I've implemented a function to match missing dataset-ids to dataset-ids stored in the table online. For CMIP6 this is ok because each variable has its own dataset id for CMIP5 a group of variables share the dataset-id.
I should add something to check for variable, however this might not be possible , I believe they're not always listed.

review click common_args

We should make sure we review the query arguments. Here are some thoughts:

I think product and variable_long_name could be deleted;
time_frequency maybe should be shortened to frequency
experiment_family might not make sense anymore while source_type (which is type of model ie. AOGM,AMIP) might be much more useful.
There are quite a few args that exists only for CMIP6 maybe we should print a warning when used with CMIP5 and then ignore them in the search

For facets like ensemble which have changed with CMIP6 maybe we should use CMIP6 name and map to cmip5 when user query cmip5, so we do not confuse users with two sets of arguments

--debug
--no-distrib
--replica Maybe these could all become a flag ? So with default no-debug, no-replica and distrib
change "format" to something else (maybe out_format??) to avoid conflicts and to clarify meaning

member_id in C6Dataset is wrong

Working on the local search I discovered that member_id in the CMIP6 table is wrong.
We build this value by joinin

sub_experiment_id + '-' + variant_label

In most cases sub_exp. is none and member_id=label_id.
I found this example for IPSL-CM6A-LR model where sub-experiment-id returned 'none' as a string and the member_id is "none-r1i1p1f1"

CSIRO MK models missing from MAS??

We still get datasets from the models:
.UNSW.CSIRO-Mk3L-1-2
and
CSIRO-QCCCE.CSIRO-Mk3-6-0
has not available on raijin
Need to check if they're in MAS and if that's the case why aren't they returned

Default `esgf local` to --all-versions

Doing the latest filter is too slow. Would it be possible to cache the latest version on esgf?

Use MAS dns name

Sean has given MAS a dns name - clef.nci.org.au. We should use this for database connections instead of the raw IP address

Should `esgf missing` return datasets instead of files?

Currently esgf missing returns file ids. This could be useful for spotting single files missing from a dataset, but in most cases the whole dataset will be downloaded. Having the search be on files may also produce an overwhelming number of matches

issue with filtering values

I think this eventually will solve itself when we'll have a validation, currently if there is not value pass for a constraint or if that value is not available locally we behave in same way.
This means that if I pass a misspelled model name for example I will get as a result all the model available matching the other constraints, rather then a warning my model doesn't exist.
I think this is the source of the issue:

for key, value in six.iteritems(dataset_constraints):
if len(value) > 0:
filters.append(getattr(C6Dataset,key).ilike(any_([x for x in value])))

        # If this key was filtered get a list of the matching values, used
        # in the ESGF query
        terms[key] = [x[0] for x in (s.query(getattr(C6Dataset,key))
            .distinct()
            .filter(getattr(C6Dataset,key).ilike(any_([x for x in value]))))]

I suppose this worked fine when we were running local and missing separately. It must have been something to do with changing the workflow

`esgf search`: Values object is not iterable

$ esgf search --model=ACCESS1.0 --experiment historical --mip Amon
Traceback (most recent call last):
  File "/short/w35/saw562/conda/envs/arccssive2-dev/bin/esgf", line 10, in <module>
    sys.exit(esgf())
  File "/short/w35/saw562/conda/envs/arccssive2-dev/lib/python3.6/site-packages/click/core.py", line 722, in __call__
    return self.main(*args, **kwargs)
  File "/short/w35/saw562/conda/envs/arccssive2-dev/lib/python3.6/site-packages/click/core.py", line 697, in main
    rv = self.invoke(ctx)
  File "/short/w35/saw562/conda/envs/arccssive2-dev/lib/python3.6/site-packages/click/core.py", line 1066, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/short/w35/saw562/conda/envs/arccssive2-dev/lib/python3.6/site-packages/click/core.py", line 895, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/short/w35/saw562/conda/envs/arccssive2-dev/lib/python3.6/site-packages/click/core.py", line 535, in invoke
    return callback(*args, **kwargs)
  File "/home/562/saw562/arccssive2/arccssive2/cli.py", line 112, in search
    for result in q:
TypeError: 'values' object is not iterable

Error in mk3 results

The directory returned by this search does not exist

$ clef --local cmip5 -m CSIRO-Mk3-6-0 -e historical -t Omon --ensemble r1i1p1 -v tos --latest
/g/data1/rr3/publications/CMIP5/output1/CSIRO-QCCCE/CSIRO-Mk3-6-0/historical/mon/ocean/Omon/r1i1p1/latest/tos

combining local and missing

I showed the tool this morning at the meeting and there was a a strong preference for having less subcommands. We also agreed that probably a user would want to know both what's local and what's missing in one go. So they suggested to make that the default and then give the option of choosing only one or the other.

If we do so then local and missing will become "flags" rather then subcommand and automatically we will have less subcommands left.

I will try to implement this first and then see if we can take it a step further and have "nested" commands based on passing a value to the "project" argument. In this way if we add more datasets, like cordex, that potentially need different arguments we can more easily adapt the tool.

adding cordex search

Adding the option of downloading cortex data.
This is not available on all nodes, DKRZ which is the one we are currently using has cordex specific facets.
here's an example of a url:

https://esgf-data.dkrz.de/esg-search/search/?offset=0&limit=10&type=Dataset&replica=false&latest=true&product=output&rcm_version=v1&domain=AFR-44&experiment_family=Historical&project=CORDEX&time_frequency=mon&experiment=historical&driving_model=CCCma-CanESM2&variable=clt&ensemble=r1i1p1&facets=rcm_name%2Cproject%2Cproduct%2Cdomain%2Cinstitute%2Cdriving_model%2Cexperiment%2Cexperiment_family%2Censemble%2Crcm_version%2Ctime_frequency%2Cvariable%2Cvariable_long_name%2Ccf_standard_name%2Cdata_node&format=application%2Fsolr%2Bjson

The arguments seems to match better CMIP5, this might be different for the CORDEX-CMIP6 where searching cortex should simply be a CMIP6 activity. These are the cortex specific attributes:

driving_model
rcm_name
rcm_version
domain
The facets "product" and "output" are normally not needed for CMIP5 searches
CORDEX has 3 values for project : CORDEX CORDEX-Adjust CORDEX-Reklies
and for output rather then a generic output1/2
"output" and "bias-adjusted-output"
So potentially users might want to pass these as well.

Probably the easiest thing to do is to let the users pass whatever argument, maybe checking that any of this facets actually exists and spit a warning if one doesn't.

for 3.1 release installation in clef-test: exception handling not working

When I run

clef cmip5
I get a full error instead of the exception handling telling me to constrain more my search

Enhancement: the % wilcard doesn't work anymore

This is because we're using click choice to validate the arguments passed by the user. I suppose we could have our own class of clickChoice that treats a string with % in the same way.
Another feedback we received was it would be nice to be able to pass a list of arguments, so
--variable tasmin tasmax
rather then
--variable tasmin --variable tasmax

again i think we can fix this by creating our own class, none of this is terribly urgent.

clef cmip6 produces error while searching CMIP6 queued table

clef cmip6 -e 1pctCO2 --activity CMIP -t Amon -v tas
Traceback (most recent call last):
File "/g/data3/hh5/public/apps/miniconda3/envs/clef-test/bin/clef", line 11, in
sys.exit(clef())
File "/g/data3/hh5/public/apps/miniconda3/envs/clef-test/lib/python3.7/site-packages/click/core.py", line 764, in call
return self.main(*args, **kwargs)
File "/g/data3/hh5/public/apps/miniconda3/envs/clef-test/lib/python3.7/site-packages/click/core.py", line 717, in main
rv = self.invoke(ctx)
File "/g/data3/hh5/public/apps/miniconda3/envs/clef-test/lib/python3.7/site-packages/click/core.py", line 1137, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/g/data3/hh5/public/apps/miniconda3/envs/clef-test/lib/python3.7/site-packages/click/core.py", line 956, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/g/data3/hh5/public/apps/miniconda3/envs/clef-test/lib/python3.7/site-packages/click/core.py", line 555, in invoke
return callback(*args, **kwargs)
File "/g/data3/hh5/public/apps/miniconda3/envs/clef-test/lib/python3.7/site-packages/click/decorators.py", line 17, in new_func
return f(get_current_context(), *args, **kwargs)
File "/g/data3/hh5/public/apps/miniconda3/envs/clef-test/lib/python3.7/site-packages/clef/cli.py", line 432, in cmip6
updated = search_queue_csv(qm, project, [])
File "/g/data3/hh5/public/apps/miniconda3/envs/clef-test/lib/python3.7/site-packages/clef/download.py", line 83, in search_queue_csv
rows[row[1]] = [row[0],row[2]]
IndexError: list index out of range

This error is caused by the fact the CMIP6_clef_table.csv has one less field then the corresponding CMIP5 table.
It's fixed in the latest "collections" commit

No matching ESGF results from `esgf missing`

esgf missing --frequency day  --experiment historical --variable tas --ensemble r1i1p1 --model % --user pxp581 > out-search.txt

is producing an error:

raise Exception('No matches found on ESGF')
Exception: No matches found on ESGF

Other errors as well:

removing --model produces a list of cordex experiments
using --model % produces error above
using --model MIROC% works but doesn't return anything
using --model CNRM-CM5 same as above (although we should find something since it's not found by local)

R integration

There's a group of R users that would like to integrate the new MAS search in their scripts.
A possibility is to return the results as json and then they can embedd the command in:
clef_raw_query <- function(...) {
json <- system(paste("clef --json ", paste(list(...), collapse=' ')))
fromJSON(json)
}

Query NCI ESGF node

NCI have improved their ESGF node, it now has a local cache for distributed searches and replicates all nodes. We should switch to using the NCI node

Add short flags to commands

clef is not returning any data for both CMIP5 and CMIP6

Appears to be due to a change in MAS. The selector to find CMIP6 data files returns no matches

postgres=> select count(*) from oi10.paths where pa_type in ('file', 'link') and pa_parents[4] = md5('/g/data1b/oi10/replicas')::uuid;
count

The pa_parents[4] check is supposed to find paths starting with that prefix

The files are actually present if we do an explicit search

postgres=> select count(*) from oi10.paths where pa_type in ('file', 'link') and pa_path like '/g/data1b/oi10/replicas/%';
count

4235

Web search URL is incorrect

The reported web search URL when no matches are found on ESGF is incorrect

$ clef cmip5 --model BCC-CSM1.1 --variable tas --experiment historical --table day
...
clef.esgf.ESGFException: No matches found on ESGF, check at https://esgf-data.dkrz.de/esg-search/search/cmip5?query=&distrib=True&replica=False&latest=True&project=CMIP5&experiment=historical&model=BCC-CSM1.1&cmor_table=day&variable=tas

This URL should be https://esgf-data.dkrz.de/search/cmip5-dkrz?query=&distrib=True&replica=False&latest=True&project=CMIP5&experiment=historical&model=BCC-CSM1.1&cmor_table=day&variable=tas

passing right model name if it contains ()

Add `mips` filter to commands

We're currently not tracking the cmor_table for each record, users would like to search using this field.

option/function that returns only model-ensemble pairs that have all the variables passed as input

rr3/publications results are not filtered based on project

As an example:
clef --request cmip5 -e 1pctCO2 -t Amon -v tas
has among the many results:

/g/data1/rr3/publications/CMIP5/output1/UNSW/CSIRO-Mk3L-1-2/1pctCO2/mon/atmos/Amon/r3i1p1/v20170728/tas/
/g/data1/rr3/publications/PMIP3/output/UNSW/CSIRO-Mk3L-1-2/1pctCO2/mon/atmos/Amon/r1i1p1/v20170728/tas/

somewhere we must be assuming that anything in rr3 is CMIP5 while only

/g/data/rr3/publications/CMIP5 should be here!

Add a command to download `esgf missing` results

The /esg-search/wget endpoint on ESGF returns a WGET script to download the matching files. Add a command so we can easily use this to download missing files

https://github.com/ESGF/esgf.github.io/wiki/ESGF_Search_REST_API#wget-scripting

cmip5 latest option returning sevral versions because of ACCESS republishing

We still have the old issue of ACCESS models having republished same files under different versions.
For example this search:

clef cmip5 -e 1pctCO2 -t Omon -v hfds
returns for ACCESS1.0 and ACCESS1.3 ensemble r1i1p1 3 different versions each.

search returning wrong local directory

When searching CMIP5, MAS return results for "output"s directories rather then "combined" which is the symlink.
The local search instead return both, this could be solved by filtering the results so only combined is shown.
The first issue might be a duplicate of an older issue, but I couldn't locate it

Enhancement: cmip6 dataset-errata integration

Errata will be managed programatically in CMIP6 using ES-DOC

https://es-doc.org/cmip6-dataset-errata/

This is an extract on how it will be possible to search for errata:

"Search & View
Groups can search & view published errata via a simple to use web application which support two types of search:
Faceted: one can search by various facets such as institute, model, experiment, variable, severity & status.
PID: one can enter persistent identifers, i.e. PIDs. Each PID is resolved to the list of affected datasets and by extension the related errata."

So we should be able to use the PID from the files to check if any errata info exists and retrieve it.

Enhancement: add option to output tracking_ids when output== file

I've been asked to add this option so a user can create a list of all the tracking_ids has used to build a citation record. I'm not 100% sure that is required as for CMIP5, I would have thought using the citation provided in the file should be enough.
In any case I was thinking go adding a flag --citation which would print out the citation id and if the output is "file" also the tracking ids.
Another option, probably better( ?) is to have a --verbose flag which will output all the extra information in a csv file. The further_info_url which has es-doc url attribute can be reconstructed from the dataset_id.
Currently the find_local function returns only the path, if it could return the dataset_id we could add this easily.
If we want the all information the following global attributes need to be retrieved too. These are not available in the db tables.

variant_info (important)
source (important)
parent_experiment_id
further_info_url
contact (important)
title (potentially redundant)
description (potentially redundant)
license (could be retrieved from es-doc info)

`esgf missing` doesn't see matches with older versions

$ esgf missing --user saw562 --experiment historical --time_frequency mon --variable pr --variable mrro --variable mrlsl --ensemble r1i1p1 --project CMIP5  --all-versions --model NorESM1-M
cmip5.output1.NCC.NorESM1-M.historical.mon.land.Lmon.r1i1p1.v20110901.mrlsl_Lmon_NorESM1-M_historical_r1i1p1_195001-200512.nc|noresg.norstore.no
cmip5.output1.NCC.NorESM1-M.historical.mon.ocean.Omon.r1i1p1.v20110901.pr_Omon_NorESM1-M_historical_r1i1p1_185001-200512.nc|noresg.norstore.no
cmip5.output1.NCC.NorESM1-M.historical.mon.seaIce.OImon.r1i1p1.v20110901.pr_OImon_NorESM1-M_historical_r1i1p1_185001-200512.nc|noresg.norstore.no

$ esgf local --user saw562 --experiment historical --time_frequency mon --variable pr --variable mrro --variable mrlsl --ensemble r1i1p1 --project CMIP5  --all-versions --model NorESM1-M
/g/data1/ua6/unofficial-ESG-replica/tmp/tree/noresg.norstore.no/thredds/fileServer/esg_dataroot/cmor/CMIP5/output1/NCC/NorESM1-M/historical/mon/land/Lmon/r1i1p1/v20120920/mrlsl/mrlsl_Lmon_NorESM1-M_historical_r1i1p1_185001-194912.nc
/g/data1/ua6/unofficial-ESG-replica/tmp/tree/noresg.norstore.no/thredds/fileServer/esg_dataroot/cmor/CMIP5/output1/NCC/NorESM1-M/historical/mon/land/Lmon/r1i1p1/v20120920/mrro/mrro_Lmon_NorESM1-M_historical_r1i1p1_185001-200512.nc
/g/data1/ua6/unofficial-ESG-replica/tmp/tree/norstore-trd-bio1.hpc.ntnu.no/thredds/fileServer/esg_dataroot/cmor/CMIP5/output/NCC/NorESM1-M/historical/mon/atmos/pr/r1i1p1/pr_Amon_NorESM1-M_historical_r1i1p1_185001-200512.nc

Expected behaviour is that by default esgf missing returns no matches, since we have matching files with different versions

coecms / clef Goto Github PK

clef's People

Contributors

Stargazers

Watchers

Forkers

clef's Issues

Challenges:

Implementation plan:

postgres=> select count(*) from oi10.paths where pa_type in ('file', 'link') and pa_parents[4] = md5('/g/data1b/oi10/replicas')::uuid; count

postgres=> select count(*) from oi10.paths where pa_type in ('file', 'link') and pa_path like '/g/data1b/oi10/replicas/%'; count

Recommend Projects

Recommend Topics

Recommend Org

postgres=> select count(*) from oi10.paths where pa_type in ('file', 'link') and pa_parents[4] = md5('/g/data1b/oi10/replicas')::uuid;
count

postgres=> select count(*) from oi10.paths where pa_type in ('file', 'link') and pa_path like '/g/data1b/oi10/replicas/%';
count