Giter VIP home page Giter VIP logo

Comments (9)

danielhuppmann avatar danielhuppmann commented on June 6, 2024 1

Sounds like a good approach to disentangle the two main classes and flip the direction of dependency.

Two suggestions:

  • make the new function for consistency between definitions and model-mappings a method of the RegionProcessor class and call it validate()...
  • call the validation method always as part of the apply(IamDataFrame, DataStructureDefinition), i.e.,
    def apply(self, IamDataFrame, DataStructureDefinition):
        self.validate(DataStructureDefinition)
        ...
    Normally, this wouldn't be a good idea, but the validation isn't very expensive and apply() is usually only performed once per workflow anyway...

from nomenclature.

phackstock avatar phackstock commented on June 6, 2024 1

Ah right, a processor subfolder and an error subfolder, yes let's do that.

from nomenclature.

danielhuppmann avatar danielhuppmann commented on June 6, 2024

For clarification: the quick-fix workaround validate variables, apply-region-processing, validate regions cannot lead to errors in the database, but there is a risk that mistakes are made when setting up or maintaining/updating a project workflow.

I would advise against putting too much logic into the simple functions - RegionProcessor.apply() "applies a mapping if it exists and otherwise returns the data unchanged" is better than "... returns the data unchanged but also validates".

So I suggest a third option - put the complicated logic into a new function of the DataStructureDefinition:

def process(df: IamDataFrame, dimensions: list = None, processor: RegionProcessor = None):
    if processor is None:
        self.validate(df)
    else:
        if not isinstance(processor, RegionProcessor):
            raise ValueError(f"Unknown processor type: {processor}")
        dimensions = dimensions or self.dimensions
        dimensions.remove("region")
        self.validate(df, dimensions= dimensions.pop) 
        df = processor.apply(df)
        self.validate(df, dimensions=["region"]) 
    return df        

Then, a workflow in a project repository can be either a simple

df = dsd.process(df)

or with region-processing

df = dsd.process(df, processor=rp)

Two advantages:

  1. This is easily forward-compatible if we introduce new types of processors (e.g., a scenario-rename-processor) or other features.
  2. This can be easily turned into a CLI.

from nomenclature.

phackstock avatar phackstock commented on June 6, 2024

Good point about putting too much logic into RegionProcessor.apply().
I like the idea of putting everything into a single function.
What I don't like about the layout that this would create is the strange relationship that RegionProcessor and DataStructureDefinition.

Right now we instantiate RegionProcessor with the directory of the mappings plus a DataStructureDefinition instance. So RegionProcessor actually contains a DataStructureDefinition instance. This is for two reasons:

  1. In the validation is is checked if all the regions from the model mapping comply with the DataStructureDefinition.
  2. In RegionProcessor.apply() it is used to determine how to aggregate variables based on the pyam kwargs provided in the variable template.

So now we would have a RegionProcessor which contains a DataStructureDefinition which is then used as an input for DataStructureDefinition.process. This makes for a rather convoluted structure.

My proposed solution would be a bit of a redesign/simplification of RegionProcessor to disentangle the two classes:

  • Remove the DataStructureDefinition from RegionProcessor altogether.
  • Instead of doing the validation of the model mappings inside RegionProcessor we create the method DataStructureDefinition.validate_model_mappings() or something along those lines. It is an analogous function to DataStructureDefinition.validate() but instead of an IamDataFrame it takes a RegionProcessor. This function might be called as a first step in DataStructureDefinition.process().
  • Taking care of the second reason we currently need a DataStructureDefinition inside RegionProcessor I would change RegionProcessor.apply() to not only require a data frame but also a DataStructureDefinition as input. This way we get the information about the variable pyam kwargs.

from nomenclature.

phackstock avatar phackstock commented on June 6, 2024

Ah very good idea, that should be the least amount of effort.

from nomenclature.

phackstock avatar phackstock commented on June 6, 2024

As I'm looking into implementing this, I'm thinking about where to put the process function. Your proposal @danielhuppmann was to implement it as part of the DatastructureDefinition class as DatastructureDefinition.process, however I'm not convinced this is the right place.
Here's my reasoning, the implementation would look something like this:

def process(self, df: IamDataFrame, dimensions: list = None, processor: RegionProcessor = None):
    if processor is None:
        self.validate(df)
    else:
        if not isinstance(processor, RegionProcessor):
            raise ValueError(f"Unknown processor type: {processor}")
        dimensions = dimensions or self.dimensions
        if "region" in dimensions:
            dimensions.remove("region")
        self.validate(df, dimensions=dimensions) 
        df = processor.apply(df, self) # This is the line I have an issue with
        self.validate(df, dimensions=["region"]) 
    return df

From my view process should be its own standalone function outside of any class. Reason being that I find it bad design to have a class method that feeds a reference to its own instance into another object, i.e. what we would be doing with processor.apply(df, self).
In my view the process function is a sort of convenience function that is there to avoid code duplication and as a result of that, mistakes. I would expose to put it in the utils module and expose it at top level so that we can use from nomenclature import process. This function would have the following signature: process(df: IamDataFrame, dsd:DataStructureDefinition, dimensions: list = None, processor: RegionProcessor = None) and inside would replace self with dsd.
What are your thoughts @danielhuppmann?

from nomenclature.

danielhuppmann avatar danielhuppmann commented on June 6, 2024

Makes a lot of sense!

What I would do as a prepration step, though, is a renaming of the files to make the structure more logical (and avoid using the word "models" in file names).

  • core -> definition
  • codes -> codelist
  • codes_models -> code
  • region_mapping_models -> processor.region
  • region_mapping_errors -> error.region

Then, the new process() function can be the first item in a new core.py.

from nomenclature.

phackstock avatar phackstock commented on June 6, 2024

Ah yeah good idea.

Just one quick clarifying question with processor.region and error.region you mean making it into its own region sub-package with processor.py and error.py inside?

from nomenclature.

danielhuppmann avatar danielhuppmann commented on June 6, 2024

Well, I guess that we will have several "processor"-types, so what I meant was to have a processor-subfolder with the different types of processors inside. But either way is fine...

from nomenclature.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.