Per a discussion with <a class="user-mention notranslate" data-hovercard-type="user" d

Support required variables and regions about nomenclature HOT 11 CLOSED

iamconsortium commented on May 26, 2024

Support *required* variables and regions

from nomenclature.

Comments (11)

danielhuppmann commented on May 26, 2024 1

For 1, if there is a "require-all" feature, the list of variables (and units) and regions would have to be an exact match.

The way that the nomenclature API is structured, you can easily use the codelists to downselect an IamDataFrame. If you look closely in the example above, you'll see that only the filtered IamDataFrame is used for the validation.

magicc.validate(df.filter(variable=magicc.variable), require_all=True)

from nomenclature.

danielhuppmann commented on May 26, 2024 1

Suggestion: we create a new class WorkflowProcessor, which has its own folder "workflows" (similar to mappings for the RegionProcessor) with a specific yaml structure.

Attributes:

name (instead of "required_for")
required_timeseries

The WorkflowProcessor can be called via nomenclature.process(df, dsd, processor=a_workflow_processor).

from nomenclature.

phackstock commented on May 26, 2024

Do I understand option 1. correctly that this would mean that each upload would have to contain every variable?
In case it is that way I would say options 2. and 3. are the better ones.
We could also do it a different way, by creating something like a minimum specification for input data. This could cover every dimension, model, scenario, variable, year(s), etc...
Something like this:

variable:
  - Final Energy
  - Final Energy|Electricity
year: [2020, 2025, ...]
region: World

from nomenclature.

phackstock commented on May 26, 2024

Now that I think about it, what such a list or required variables would be super useful for is whenever there are post-processing steps involved in a model comparison study. This way you can ensure that the post-processing colleagues have everything they need to work with. That would save a lot of time and frustration.

from nomenclature.

danielhuppmann commented on May 26, 2024

For clarification, I meant that we would need all three options, not as an either-or.

And yes, use case 1 is for post-processing, basically re-using the DataStructureDefinition as the "minimum specification".

Say, for the openENTRANCE project using MAGICC climate-postprocessing, there would be two DataStructureDefinitions. And the code could look something like

df = pyam.IamDataFrame("<file>")

oe = nomenclature.DataStructureDefinition("openentrance")
oe.validate(df)

magicc = nomenclature.DataStructureDefinition("magicc")
magicc.validate(df.filter(variable=magicc.variable), require_all=True)

where the MAGICC variables are a subset of the openentrance variables...

from nomenclature.

phackstock commented on May 26, 2024

Ah right. Just so that I understand this correctly, this means that there would be two folders in the openENTRANCE project then. One for the list of all allowed variables and one for the list of required variables for, in this case, MAGICC?
In this case I think it might make sense to start talking about centralizing these post-processing specific variable requirement lists. They should be largely static right?

from nomenclature.

danielhuppmann commented on May 26, 2024

Yes, but that is not a discussion to be had in this repository or this issue... First, we need to be able to "require" variables in the nomenclature package.

from nomenclature.

lewisjared commented on May 26, 2024

For case 1) is the test that at least the required set variables/regions are provided or that exact set of variables/regions are provided? i.e explode if additional information is provided.

Ideally downstream models are indifferent to additional data, but some may not be. Is that something that you would want to support?

from nomenclature.

phackstock commented on May 26, 2024

Coming back to this issue as the MAGICC use case is high on the list right now.
I really like to idea of re-using the codelists, either by having a required attribute in the original codelist or having a separate one.
The one potential limitation that I would see is that we would essentially combine "everything with everything". An example of that would be requiring a timeseries variable to be present for a number of years. If we created a list of allowed years and enforced that we would probably run into trouble as different models feature different time resolutions. This could be again solved by using a required attribute for the required years, but for some variables we might need data until 2050 while for others until 2100.

Maybe the cleaner approach would be to use a different structure with more fine grained control. Something like this:

required_for: MAGICC
required_timeseries:
  - variable: Emissions|CO2
    region: World
    years: [2020, 2025, 2050, 2075, 2100]
    required: True
  - variable: Emissions|CH4
    region: World
    years: [2020, 2025, 2050, 2075, 2100]
    optional: True
  - variable: [Final Energy, Final Energy|Electricity, Final Energy...] 
    region: [R5 Asia, R5 Middle East & Africa, ...]
    scenario: [Current policies, NDCs, ...] 
    years: [2020, 2025, 2030, 2035, 2040, 2045, 2050, 2060, ...]
    required: True

the first attribute required_for is mainly for user communication. Considering a situation where we have multiple different post-processings and/or just general project requirements, it is probably important to know what's required for what. Also for error messages this should be helpful.
required_timesiers contains a list of timeseries definitions. In the above example we have a Emissions|CO2 as a required timeseries and Emissions|CH4 as a optional one (if I recall correctly that is actually how the AR6 climate assessment pipeline runs with requiring some variables to be model native and allowing infilling for others).
Within each "timeseries" definition we still require "everything with everything", as the third timeseries requirement demonstrates. A dataframe would only pass if it has all the variables (Final Energy, Final Energy|Electricity, ...) for each variable each scenario, for each variable and scenario combination each region and finally for each variable, scenario, region combination each year.

This way we should get the ease of being able to define a large amount of timeseries in a single chunk while maintaining fine grained control.

@lewisjared and @danielhuppmann would love to hear your thoughts on the matter. Do you think this would be a good way to go?

from nomenclature.

phackstock commented on May 26, 2024

Sounds good to me. I'll get on that.

from nomenclature.

phackstock commented on May 26, 2024

@danielhuppmann I've started implementing this feature and I would have a few design questions. Should we discuss them here or is it better if I open a draft PR and we tackle them over the actual code?

from nomenclature.

Support required variables and regions about nomenclature HOT 11 CLOSED

Comments (11)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent