This issue is in response to <a class="issue-link js-issue-link" data-error-text="Fail

Ah right, a processor subfolder and an <code class="n

Order of validation & aggregation about nomenclature HOT 9 CLOSED

iamconsortium commented on June 6, 2024

Order of validation & aggregation

from nomenclature.

Comments (9)

danielhuppmann commented on June 6, 2024 1

Sounds like a good approach to disentangle the two main classes and flip the direction of dependency.

Two suggestions:

make the new function for consistency between definitions and model-mappings a method of the RegionProcessor class and call it validate()...
call the validation method always as part of the apply(IamDataFrame, DataStructureDefinition), i.e.,
```
def apply(self, IamDataFrame, DataStructureDefinition):
    self.validate(DataStructureDefinition)
    ...
```
Normally, this wouldn't be a good idea, but the validation isn't very expensive and apply() is usually only performed once per workflow anyway...

from nomenclature.

phackstock commented on June 6, 2024 1

Ah right, a processor subfolder and an error subfolder, yes let's do that.

from nomenclature.

danielhuppmann commented on June 6, 2024

For clarification: the quick-fix workaround validate variables, apply-region-processing, validate regions cannot lead to errors in the database, but there is a risk that mistakes are made when setting up or maintaining/updating a project workflow.

I would advise against putting too much logic into the simple functions - RegionProcessor.apply() "applies a mapping if it exists and otherwise returns the data unchanged" is better than "... returns the data unchanged but also validates".

So I suggest a third option - put the complicated logic into a new function of the DataStructureDefinition:

def process(df: IamDataFrame, dimensions: list = None, processor: RegionProcessor = None):
    if processor is None:
        self.validate(df)
    else:
        if not isinstance(processor, RegionProcessor):
            raise ValueError(f"Unknown processor type: {processor}")
        dimensions = dimensions or self.dimensions
        dimensions.remove("region")
        self.validate(df, dimensions= dimensions.pop) 
        df = processor.apply(df)
        self.validate(df, dimensions=["region"]) 
    return df

Then, a workflow in a project repository can be either a simple

df = dsd.process(df)

or with region-processing

df = dsd.process(df, processor=rp)

Two advantages:

This is easily forward-compatible if we introduce new types of processors (e.g., a scenario-rename-processor) or other features.
This can be easily turned into a CLI.

from nomenclature.

phackstock commented on June 6, 2024

Good point about putting too much logic into RegionProcessor.apply().
I like the idea of putting everything into a single function.
What I don't like about the layout that this would create is the strange relationship that RegionProcessor and DataStructureDefinition.

Right now we instantiate RegionProcessor with the directory of the mappings plus a DataStructureDefinition instance. So RegionProcessor actually contains a DataStructureDefinition instance. This is for two reasons:

In the validation is is checked if all the regions from the model mapping comply with the DataStructureDefinition.
In RegionProcessor.apply() it is used to determine how to aggregate variables based on the pyam kwargs provided in the variable template.

So now we would have a RegionProcessor which contains a DataStructureDefinition which is then used as an input for DataStructureDefinition.process. This makes for a rather convoluted structure.

My proposed solution would be a bit of a redesign/simplification of RegionProcessor to disentangle the two classes:

Remove the DataStructureDefinition from RegionProcessor altogether.
Instead of doing the validation of the model mappings inside RegionProcessor we create the method DataStructureDefinition.validate_model_mappings() or something along those lines. It is an analogous function to DataStructureDefinition.validate() but instead of an IamDataFrame it takes a RegionProcessor. This function might be called as a first step in DataStructureDefinition.process().
Taking care of the second reason we currently need a DataStructureDefinition inside RegionProcessor I would change RegionProcessor.apply() to not only require a data frame but also a DataStructureDefinition as input. This way we get the information about the variable pyam kwargs.

from nomenclature.

phackstock commented on June 6, 2024

Ah very good idea, that should be the least amount of effort.

from nomenclature.

phackstock commented on June 6, 2024

As I'm looking into implementing this, I'm thinking about where to put the process function. Your proposal @danielhuppmann was to implement it as part of the DatastructureDefinition class as DatastructureDefinition.process, however I'm not convinced this is the right place.
Here's my reasoning, the implementation would look something like this:

def process(self, df: IamDataFrame, dimensions: list = None, processor: RegionProcessor = None):
    if processor is None:
        self.validate(df)
    else:
        if not isinstance(processor, RegionProcessor):
            raise ValueError(f"Unknown processor type: {processor}")
        dimensions = dimensions or self.dimensions
        if "region" in dimensions:
            dimensions.remove("region")
        self.validate(df, dimensions=dimensions) 
        df = processor.apply(df, self) # This is the line I have an issue with
        self.validate(df, dimensions=["region"]) 
    return df

From my view process should be its own standalone function outside of any class. Reason being that I find it bad design to have a class method that feeds a reference to its own instance into another object, i.e. what we would be doing with processor.apply(df, self).
In my view the process function is a sort of convenience function that is there to avoid code duplication and as a result of that, mistakes. I would expose to put it in the utils module and expose it at top level so that we can use from nomenclature import process. This function would have the following signature: process(df: IamDataFrame, dsd:DataStructureDefinition, dimensions: list = None, processor: RegionProcessor = None) and inside would replace self with dsd.
What are your thoughts @danielhuppmann?

from nomenclature.

danielhuppmann commented on June 6, 2024

Makes a lot of sense!

What I would do as a prepration step, though, is a renaming of the files to make the structure more logical (and avoid using the word "models" in file names).

core -> definition
codes -> codelist
codes_models -> code
region_mapping_models -> processor.region
region_mapping_errors -> error.region

Then, the new process() function can be the first item in a new core.py.

from nomenclature.

phackstock commented on June 6, 2024

Ah yeah good idea.

Just one quick clarifying question with processor.region and error.region you mean making it into its own region sub-package with processor.py and error.py inside?

from nomenclature.

danielhuppmann commented on June 6, 2024

Well, I guess that we will have several "processor"-types, so what I meant was to have a processor-subfolder with the different types of processors inside. But either way is fine...

from nomenclature.

Order of validation & aggregation about nomenclature HOT 9 CLOSED

Comments (9)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent