Comments (11)
For 1, if there is a "require-all" feature, the list of variables (and units) and regions would have to be an exact match.
The way that the nomenclature API is structured, you can easily use the codelists to downselect an IamDataFrame. If you look closely in the example above, you'll see that only the filtered IamDataFrame is used for the validation.
magicc.validate(df.filter(variable=magicc.variable), require_all=True)
from nomenclature.
Suggestion: we create a new class WorkflowProcessor, which has its own folder "workflows" (similar to mappings
for the RegionProcessor) with a specific yaml structure.
Attributes:
- name (instead of "required_for")
- required_timeseries
The WorkflowProcessor can be called via nomenclature.process(df, dsd, processor=a_workflow_processor)
.
from nomenclature.
Do I understand option 1. correctly that this would mean that each upload would have to contain every variable?
In case it is that way I would say options 2. and 3. are the better ones.
We could also do it a different way, by creating something like a minimum specification for input data. This could cover every dimension, model
, scenario
, variable
, year(s)
, etc...
Something like this:
variable:
- Final Energy
- Final Energy|Electricity
year: [2020, 2025, ...]
region: World
from nomenclature.
Now that I think about it, what such a list or required variables would be super useful for is whenever there are post-processing steps involved in a model comparison study. This way you can ensure that the post-processing colleagues have everything they need to work with. That would save a lot of time and frustration.
from nomenclature.
For clarification, I meant that we would need all three options, not as an either-or.
And yes, use case 1 is for post-processing, basically re-using the DataStructureDefinition as the "minimum specification".
Say, for the openENTRANCE project using MAGICC climate-postprocessing, there would be two DataStructureDefinitions. And the code could look something like
df = pyam.IamDataFrame("<file>")
oe = nomenclature.DataStructureDefinition("openentrance")
oe.validate(df)
magicc = nomenclature.DataStructureDefinition("magicc")
magicc.validate(df.filter(variable=magicc.variable), require_all=True)
where the MAGICC variables are a subset of the openentrance variables...
from nomenclature.
Ah right. Just so that I understand this correctly, this means that there would be two folders in the openENTRANCE project then. One for the list of all allowed variables and one for the list of required variables for, in this case, MAGICC?
In this case I think it might make sense to start talking about centralizing these post-processing specific variable requirement lists. They should be largely static right?
from nomenclature.
Yes, but that is not a discussion to be had in this repository or this issue... First, we need to be able to "require" variables in the nomenclature package.
from nomenclature.
For case 1) is the test that at least the required set variables/regions are provided or that exact set of variables/regions are provided? i.e explode if additional information is provided.
Ideally downstream models are indifferent to additional data, but some may not be. Is that something that you would want to support?
from nomenclature.
Coming back to this issue as the MAGICC use case is high on the list right now.
I really like to idea of re-using the codelists, either by having a required
attribute in the original codelist or having a separate one.
The one potential limitation that I would see is that we would essentially combine "everything with everything". An example of that would be requiring a timeseries variable to be present for a number of years. If we created a list of allowed years and enforced that we would probably run into trouble as different models feature different time resolutions. This could be again solved by using a required
attribute for the required years, but for some variables we might need data until 2050 while for others until 2100.
Maybe the cleaner approach would be to use a different structure with more fine grained control. Something like this:
required_for: MAGICC
required_timeseries:
- variable: Emissions|CO2
region: World
years: [2020, 2025, 2050, 2075, 2100]
required: True
- variable: Emissions|CH4
region: World
years: [2020, 2025, 2050, 2075, 2100]
optional: True
- variable: [Final Energy, Final Energy|Electricity, Final Energy...]
region: [R5 Asia, R5 Middle East & Africa, ...]
scenario: [Current policies, NDCs, ...]
years: [2020, 2025, 2030, 2035, 2040, 2045, 2050, 2060, ...]
required: True
the first attribute required_for
is mainly for user communication. Considering a situation where we have multiple different post-processings and/or just general project requirements, it is probably important to know what's required for what. Also for error messages this should be helpful.
required_timesiers
contains a list of timeseries definitions. In the above example we have a Emissions|CO2
as a required timeseries and Emissions|CH4
as a optional one (if I recall correctly that is actually how the AR6 climate assessment pipeline runs with requiring some variables to be model native and allowing infilling for others).
Within each "timeseries" definition we still require "everything with everything", as the third timeseries requirement demonstrates. A dataframe would only pass if it has all the variables (Final Energy, Final Energy|Electricity, ...) for each variable each scenario, for each variable and scenario combination each region and finally for each variable, scenario, region combination each year.
This way we should get the ease of being able to define a large amount of timeseries in a single chunk while maintaining fine grained control.
@lewisjared and @danielhuppmann would love to hear your thoughts on the matter. Do you think this would be a good way to go?
from nomenclature.
Sounds good to me. I'll get on that.
from nomenclature.
@danielhuppmann I've started implementing this feature and I would have a few design questions. Should we discuss them here or is it better if I open a draft PR and we tackle them over the actual code?
from nomenclature.
Related Issues (20)
- Is CodeList.name needed? HOT 1
- Default-attributes are exported to yaml HOT 2
- Filter CodeLists by any attribute HOT 4
- Pandas 2.0 breaks test_check_aggregate.expected_fail_return
- MetaCode brainstorming HOT 6
- Align Processor structure HOT 2
- Use different name for ISO3-code attribute
- Improve required data output
- pydantic 2.0 breaks nomenclature
- General options for a DataStructureDefinition HOT 3
- Enable using model mappings from external repository
- Implement MetaCodeList-validation using pandera
- Get relative path fails if file is not part of current working directory
- nomenclature install from GitHub fails exclude tests HOT 1
- Refactor and extend RequiredDataValidator HOT 2
- Invalid regions do not raise an error as part of RegionProcessor
- General-config fails if folder is not present
- country names HOT 2
- Potential conflicts with overlapping region-aggregation instructions
- Upgrade to use latest pydantic HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from nomenclature.