Comments (9)
Sounds like a good approach to disentangle the two main classes and flip the direction of dependency.
Two suggestions:
- make the new function for consistency between definitions and model-mappings a method of the RegionProcessor class and call it
validate()
... - call the validation method always as part of the
apply(IamDataFrame, DataStructureDefinition)
, i.e.,Normally, this wouldn't be a good idea, but the validation isn't very expensive anddef apply(self, IamDataFrame, DataStructureDefinition): self.validate(DataStructureDefinition) ...
apply()
is usually only performed once per workflow anyway...
from nomenclature.
Ah right, a processor
subfolder and an error
subfolder, yes let's do that.
from nomenclature.
For clarification: the quick-fix workaround validate variables, apply-region-processing, validate regions
cannot lead to errors in the database, but there is a risk that mistakes are made when setting up or maintaining/updating a project workflow.
I would advise against putting too much logic into the simple functions - RegionProcessor.apply()
"applies a mapping if it exists and otherwise returns the data unchanged" is better than "... returns the data unchanged but also validates".
So I suggest a third option - put the complicated logic into a new function of the DataStructureDefinition
:
def process(df: IamDataFrame, dimensions: list = None, processor: RegionProcessor = None):
if processor is None:
self.validate(df)
else:
if not isinstance(processor, RegionProcessor):
raise ValueError(f"Unknown processor type: {processor}")
dimensions = dimensions or self.dimensions
dimensions.remove("region")
self.validate(df, dimensions= dimensions.pop)
df = processor.apply(df)
self.validate(df, dimensions=["region"])
return df
Then, a workflow in a project repository can be either a simple
df = dsd.process(df)
or with region-processing
df = dsd.process(df, processor=rp)
Two advantages:
- This is easily forward-compatible if we introduce new types of processors (e.g., a scenario-rename-processor) or other features.
- This can be easily turned into a CLI.
from nomenclature.
Good point about putting too much logic into RegionProcessor.apply()
.
I like the idea of putting everything into a single function.
What I don't like about the layout that this would create is the strange relationship that RegionProcessor
and DataStructureDefinition
.
Right now we instantiate RegionProcessor
with the directory of the mappings plus a DataStructureDefinition
instance. So RegionProcessor
actually contains a DataStructureDefinition
instance. This is for two reasons:
- In the validation is is checked if all the regions from the model mapping comply with the
DataStructureDefinition
. - In
RegionProcessor.apply()
it is used to determine how to aggregate variables based on the pyam kwargs provided in the variable template.
So now we would have a RegionProcessor
which contains a DataStructureDefinition
which is then used as an input for DataStructureDefinition.process
. This makes for a rather convoluted structure.
My proposed solution would be a bit of a redesign/simplification of RegionProcessor
to disentangle the two classes:
- Remove the
DataStructureDefinition
fromRegionProcessor
altogether. - Instead of doing the validation of the model mappings inside
RegionProcessor
we create the methodDataStructureDefinition.validate_model_mappings()
or something along those lines. It is an analogous function toDataStructureDefinition.validate()
but instead of anIamDataFrame
it takes aRegionProcessor
. This function might be called as a first step inDataStructureDefinition.process()
. - Taking care of the second reason we currently need a
DataStructureDefinition
insideRegionProcessor
I would changeRegionProcessor.apply()
to not only require a data frame but also aDataStructureDefinition
as input. This way we get the information about the variable pyam kwargs.
from nomenclature.
Ah very good idea, that should be the least amount of effort.
from nomenclature.
As I'm looking into implementing this, I'm thinking about where to put the process
function. Your proposal @danielhuppmann was to implement it as part of the DatastructureDefinition
class as DatastructureDefinition.process
, however I'm not convinced this is the right place.
Here's my reasoning, the implementation would look something like this:
def process(self, df: IamDataFrame, dimensions: list = None, processor: RegionProcessor = None):
if processor is None:
self.validate(df)
else:
if not isinstance(processor, RegionProcessor):
raise ValueError(f"Unknown processor type: {processor}")
dimensions = dimensions or self.dimensions
if "region" in dimensions:
dimensions.remove("region")
self.validate(df, dimensions=dimensions)
df = processor.apply(df, self) # This is the line I have an issue with
self.validate(df, dimensions=["region"])
return df
From my view process
should be its own standalone function outside of any class. Reason being that I find it bad design to have a class method that feeds a reference to its own instance into another object, i.e. what we would be doing with processor.apply(df, self)
.
In my view the process
function is a sort of convenience function that is there to avoid code duplication and as a result of that, mistakes. I would expose to put it in the utils module and expose it at top level so that we can use from nomenclature import process
. This function would have the following signature: process(df: IamDataFrame, dsd:DataStructureDefinition, dimensions: list = None, processor: RegionProcessor = None)
and inside would replace self
with dsd
.
What are your thoughts @danielhuppmann?
from nomenclature.
Makes a lot of sense!
What I would do as a prepration step, though, is a renaming of the files to make the structure more logical (and avoid using the word "models" in file names).
core
->definition
codes
->codelist
codes_models
->code
region_mapping_models
->processor.region
region_mapping_errors
->error.region
Then, the new process()
function can be the first item in a new core.py
.
from nomenclature.
Ah yeah good idea.
Just one quick clarifying question with processor.region
and error.region
you mean making it into its own region
sub-package with processor.py
and error.py
inside?
from nomenclature.
Well, I guess that we will have several "processor"-types, so what I meant was to have a processor
-subfolder with the different types of processors inside. But either way is fine...
from nomenclature.
Related Issues (20)
- Variable codelists generated by nomenclature do not comply with json schema HOT 1
- Validate "countries" attribute against ISO3 codes HOT 2
- Get list of regions by hierarchy level HOT 2
- RegionProcessor.apply drops meta data HOT 2
- Is CodeList.name needed? HOT 1
- Default-attributes are exported to yaml HOT 2
- Filter CodeLists by any attribute HOT 4
- Pandas 2.0 breaks test_check_aggregate.expected_fail_return
- MetaCode brainstorming HOT 6
- Align Processor structure HOT 2
- Use different name for ISO3-code attribute
- Improve required data output
- pydantic 2.0 breaks nomenclature
- General options for a DataStructureDefinition HOT 3
- Enable using model mappings from external repository
- Implement MetaCodeList-validation using pandera
- Get relative path fails if file is not part of current working directory
- nomenclature install from GitHub fails exclude tests HOT 1
- Refactor and extend RequiredDataValidator HOT 2
- Invalid regions do not raise an error as part of RegionProcessor
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from nomenclature.