Giter VIP home page Giter VIP logo

nomenclature's People

Contributors

danielhuppmann avatar gretchenschowalter avatar lauwien avatar luciecastella avatar phackstock avatar stickler-ci[bot] avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

nomenclature's Issues

Check for collision within a set of region mappings

Building on #22, one of the next steps in verification is to check that there are no collisions within all of the provided mappings.
Illustration of one failure scenario:
Given mapping_1:

name: model_1
native_regions:
  - native_region_a: alternative_name

mapping_2:

name: model_2
native_regions:
  - alternative_name

Alternatively it could also be that in mapping_2 we have a region called native_region_b which we end up renaming alternative_name.
In any case a next step in region mapping could be to check for this.
@danielhuppmann, @peterkolp is this really an issue or is this actually fine? If it is fine, feel free to close the issue. If it is something that should be checked are there any other failure scenarios that should be checked? For the common regions it is of course expected that they share the same name.

Single constituent common region can lead to unexpected results

This is the result from a conversation with @pweigmann.
The partial region aggregation feature can lead to unexpected results if it is only comprised of a single common region, e.g.:

model: model_a
common_regions:
  - Region_A:
    - region_a

this configuration is not super rare as it can often be the case that a region, that a model natively reports, region_a in the example, falls under the category of a comparison region. In this case, no aggregation is necessary, a renaming from region_a to Region_A is sufficient to ensure comparability with other models.

This is where unexpected behavior can occur as common_regions in essence means "perform region-aggregation". At the same time, region aggregation works on a principle we call "partial region-aggregation" (https://nomenclature-iamc.readthedocs.io/en/latest/usage.html#partial-region-aggregation) which means that nomenclature will perform the aggregation of the constituent regions but only use that for comparison. The results of the processing will be the model native results.
However, this only works if the name of the resulting common region, in our case Region_A remains the same, i.e.:

model: model_a
common_regions:
  - Region_A:
    - Region_A

if we have a renaming only the aggregated results will be used.
This is normally not a problem as they should be identical to the provided model native values.
However, in a variable uses weighted aggregation such as Price|Carbon and the weight variable contains negative values we lose data points where the weight is negative.

There are a few options that this problem can be fixed/circumvented:

  1. Disallow single constituent common regions, instead force them to be in the native_regions section.
  2. Implement a check in the region aggregation to skip aggregation in case a common region is only comprised of a single region and only rename.

My preference would be option 1. as it is cleaner to implement, the region aggregation comparison feature does not really make sense if the region is only comprised of a single region, and there is another related issue with the current implementation:

Assuming a model natively reports two regions, "region_a" and "region_b". Assuming further that they should both be renamed to "Region_A" and "Region_B" respectively. Minus the potential issues with missing data, the following model mapping works:

model: model_a
native_regions:
  - region_a: Region_A
common_regions:
  - Region_B:
    - region_b 

The inconsistency here is that for native_regions the value on the right is the one that will be in the result, For common_regions (here used for renaming only), on the other hand, the one on the left/above is the desired one.
One way to solve this would be go with option 1. mentioned above and disallow single-constituent common regions. This would also be in line with the zen of python "There should be one-- and preferably only one --obvious way to do it.". In this case the renaming of model native regions could be done only by adding it to the native_regions list.
If we go with this option we might also have to reconsider the naming of the two categories native_regions and common_regions as they would then more closely correspond to do-not-aggreate and aggregate.

@danielhuppmann, please let me know what you think.
Also @pweigmann, @orichters and @Renato-Rodrigues (you guys came to mind first as modellers that have current experience with using nomenclature, please feel free to loop in other colleages), your input is most welcome.

Clean up leftovers from renaming to `DataStructureDefinition`

There are some leftover references to Nomenclature, the old name of the DataStructureDefinition class.
From taking a quick look I've found references in validation.py, conftest.py, test_core.py, test_region_aggregation.py, test_testing.py and test_validation.py.

Raise/warn if unexpected regions exist for region-processing

The RegionProcessor currently silently removes all regions that are not native regions (and once PR #99 is merged, common regions).

This can be confusing for users who successfully upload a data file to the Scenario Explorer but then do not find the removed regions in the Explorer.

Question:

  1. fail loudly by raising an error if there are unexpected regions
  2. write a warning to the log of the removed regions

Option 1 is more annoying for users, but Option 2 has the caveat that users usually do not look at the log file, in particular if the color is green and some data arrived in the database.

Preferences? @phackstock @Renato-Rodrigues @JohannesEmm @HauHe @robertpietzcker

Add definitions and mappings folder options to the cli

Following the discussion in #64, it emerged that the CLI currently looks for folders called "mappings" and "definitions" in cli.cli_valid_project.
While these two names are a good standard they should be able to be modified as we allow the same when instantiating RegionProcessor and DataStructureDefinition.

Check for model mapping collisions

We should include a check that assures that we only have one mapping per model.

From my point of view the main question in terms of implementation would be how to handle these tasks which are related to an ensemble of mappings. If we only only need this one check I would say we can do it inside a function that creates a dictionary should do, like something along those lines:

def read_all_mappings(mapping_dir: Union[str, path]) -> dict:
    region_mapping_dict = {}
    for mapping_file in mapping_dir:
        mapping = RegionAggregationMapping(mapping_file)
        if mapping.model in region_mapping_dict:
            raise ModelMappingOverlapError()
    return region_mapping_dict

>>> read_all_mappings(Path("mappings/"))
{"model_1": RegionAggregationMapping("model_1"), "model_2": RegionAggregationMapping("model_2")}

However, as we might want to have more functions associated with a collection of RegionAggregationMapping objects, we might want to package that in another, possibly also pydantic, class. Like so:

from pydantic import BaseModel

class RegionAggregationCollection(BaseModel):
    mappings: Dict[str, RegionAggregation]
    
    @classmethod
    def read_all_mappings(cls, mapping_dir: Union[str, path]) -> type(cls):
        # implementation see read_all_mappings above

    def __getitem__(self, key: str):
        return mappings[key]

The __getitem__ method is emulate the dict behavior so that calls like RegionAggregationCollection["model_1"] are possible. In this class we could then also add other validation functions or other utilities.

My personal preference would be for option 2 as it continues the pattern so far using pydantic and all functions related to a group of region aggregation mappings, naturally fit there

Shorten `file` attribute in codelists

All items in a codelist have an attribute file to identify from which file the code was imported - this might be helpful in debugging in more complex nomenclatures.

The file name should be saved relative to the definitions folder, not relative to the working directory.

Clean up API docs

There are three issues in the API docs:

  • The intersphinx mapping to nomenclature.RegionProcessor currently does not work because the API docs are built with automodule, see the comment by @phackstock: #77 (comment)
  • The auto-docs of the pydantic-based classes show a lot of functions that are not relevant to the users
  • Having the type-hints in the function signature makes for pretty-much-useless docs, see below (from here)

image

Order of validation & aggregation

This issue is in response to openENTRANCE/openentrance#130.

There is an issue that can occur related to the order of operations for validation and region aggregation.
The way these two steps were originally supposed to be run is like this:

dsd.validate(df)
df = rp.apply(df)

That brings a problem because df contains might contain regions that are not in the DataStructureDefinition. That might not be a problem though as these regions might be dropped or renamed as part of the region processing.
This brought the following workaround:

# only validate the variables (or any other non region dimensions) at this point
dsd.validate(df, dimensions=["variable"]) 
df = rp.apply(df)

This works as the model mapping is checked in the region processor, upon instantiation so any illegal regions are going to be detected. There is, however, one case in which this falls short. If we have a model for which we don't have a model mapping defined nothing is checked so we would still need to verify the resulting data frame explicitly.

# only validate the variables (or any other non region dimensions) at this point
dsd.validate(df, dimensions=["variable"]) 
df = rp.apply(df)
dsd.validate(df, dimensions=["region"]) 

This is not the cleanest implementation and can easily lead to wrong errors or illegal regions in the data base.

The crux of the problem is that the question whether or not the DataStructureDefinition should validate the regions of the incoming df directly hinges on whether or not a model mapping is defined for the data inside df:

  • If there is a mapping, the RegionProcessor has already taken care of this step so the incoming regions must not be validated.
  • If there is no mapping, the regions must be validated by DataStructureDefinition as there is no other validation step.

The information of the existence of a model mapping is only available to the RegionProcessor, not the DataStructureDefinition.

Therefore I would propose two different solutions:

  1. Move the validation functionality also inside RegionProcessor. By creating RegionProcessor.validate() we can iterate through the models inside the dataframe and decide on a per model basis whether or not we validate the regions. This would give something like this:
rp.validate(df)
df = rp.apply(df)
  1. Alternatively, we can make region validation part of RegionProcessor.apply(). This would take minimal effort as there is already an if clause in RegionProcessor.apply() that checks whether or not a model mapping is present. As of right now it only logs the fact that there is none and then returns the data unchanged. Adding region validation here would be a single line. With this the validation process would look like this:
dsd.validate(df) # per default anything except regions
df = rp.apply(df) # apply region check if no model mapping is found

What are your thoughts @danielhuppmann? I hope I got the description of the problem correct.

Check if resulting IamDataFrame is empty

In processor.region.RegionProcessor.apply we currently check if the list processed_dfs is empty and raise an error if it is.
In addition to this check we need to perform another check to see if the individual frames are longer than 0.
If all of them are 0 we raise an error.

Partial region-aggregation for common-regions

There is a use case where part of the timeseries data for a region is included in the IamDataFrame (e.g., scenario data uploaded to the IIASA Scenario Explorer), and other part of the data is aggregated from the sub-regions (as is currently done by the region processor).

A simple solution would be to keep any timeseries data for common-regions and only compute the aggregate-region-data for other variables.

Questions to be discussed:

  • Should we check for conflicts? e.g., a value provided at "World" level is different from the value computed by the RegionProcessor... Should this be an error or warning?

Restructuring the documentation

The current documentation sections "Getting Started" and "Usage" are not as useful or easy-to-understand as they should be.

Here is a suggestion for a new structure:

  1. Remove "Installation" as own page
  2. Change "Getting started" to have the following sections:
    • Installation: what previously was on the "Installation" page
    • Project-specific configuration: explain that each project has a configuration folder usually hosted on Github, explain that a user has to download/clone the GitHub repo to run the validation for a specific project locally
    • Data validation: One paragraph showing the CLI for nomenclature validate-project .
  3. Change "Usage" to have the following sub-pages
    • Folder structure
    • Definitions
    • Mappings for region processing

The Usage pages should be aimed at novice Python users, so avoiding Python jargon. Also, "Notes" is probably not an idea header, because it does not tell a user what kind of notes to expect... Better to have descriptive sub-section titles.

Add file name to errors in RegionAggregationMapping

Currently, if we have an error validating a RegionAggregationMapping we only get the cause of the error but not the file where that mapping is saved. In case of a big number of mapping files searching for the correct one can become a tedious task. Especially when thinking about further validation steps, i.e. checking for model collisions and validating against DataStructureDefinition.

Proposed fix

Add the attribute mapping_file to RegionAggregationMapping and update the involved methods and tests.

What are your thoughts on that @danielhuppmann?

Check for allowed regions

Building on #22, the next step should be to integrate RegionAggregationMapping with DataStructureDefinition in order to verify that all specified native regions as well as common regions are allowed.

Region-Mapping for scenario explorer workflows

Opening the issue per discussion with @danielhuppmann.

We want a function which can do region aggregation given two inputs: df: IamDataFrame and mapping_path: pathlib.Path. It should return the aggregated IamDataFrame.

Some open points:

  • Should the returned dataframe only contain the aggregated regions or also the original constituent regions?
  • Should the function make use of the region aggregation capabilities of IamDataFrame.rename() or IamDataFrame.aggregate_region()
  • How do we preform region aggregation for variables where a simple sum is not the correct way to aggregate?

Start proper docs

Set up a proper Sphinx-docs structure with the following sections:

  • Overview and scope (pointing to openENTRANCE as the first use case, referencing pyam as the main dependency, with openENTRANCE and Horizon 2020 ackowledgement as in the current README)
  • Usage (focus on the yaml format)
    • Code lists (region, variable with units, tags, generic)
    • Region mappings
  • API documentation
  • CLI documentation

Improve aggregation comparison

A use case that emerged from NGFS scenario submissions (but also applies to any other scenario comparison exercise) is to improve the usability of the comparison feature which is part of the partial regions aggregation.
A typical comparison output might look like this, taken from https://data.ece.iiasa.ac.at/ngfs-internal/#/uploads/jobs/189/log:

                                                                                                 common-region  aggregation
model                       scenario region variable                            unit       year
MESSAGEix-GLOBIOM 1.1-M-R12 h_cpol   World  Capital Cost|Electricity|Geothermal US$2010/kW 2020       0.000000  4535.461014
                                                                                           2025       0.000000  4141.136461
                                                                                           2030       0.000000  3748.030892
                                                                                           2035       0.000000  3487.297166
                                                                                           2040       0.000000  3240.241874
...                                                                                                        ...          ...
                            o_2c     World  Yield|Sugarcrops                    t DM/ha/yr 2070      19.216144    18.438459
                                                                                           2080      19.955667    19.141642
                                                                                           2090      20.586738    19.947336
                                                                                           2100      20.888695    20.636481
                                                                                           2110      20.549795    20.567953

[3876 rows x 2 columns]

which is a good start but now super useful as it leaves out a lot of information.
Some of the things that would be useful are:

  • Full list of all model, scenario, variable combinations where there are differences.
  • Option to export the result which is a pandas.DataFrame obtained with pyam.compare (https://pyam-iamc.readthedocs.io/en/stable/api/general.html).
  • Add a column which gives the relative difference. This might be important as IAM data can span different units and orders of magnitude. This should be implemented as a pyam feature though.
  • Sort the resulting data frame by the relative difference.

Add Zenodo entry

We want to have a zenodo entry for the nomenclature package.
This should update itself for every new release.

Automate release

For future releases of the nomenclature package on pypi we want a GitHub action that automatically updates the pypi entry.

Inspiration may be taken from the publish.yaml workflow from the units package.

Make aggregation region validation optional

As I was updating the documentation for #74 I found a possible corner case that we are currently not covering.

Let's assume the following:

  1. We have a DataStructureDefinition with a region dimension.
  2. We use a RegionProcessor.
  3. We do not want to validate the region dimension, so we provide a custom dimension list to process.

As part of RegionProcessor.apply we currently check that the renaming and aggregation complies with what's given in the DataStructureDefinition. This means that even though we might provide a custom dimension list that does not contain region, RegionProcessor.apply will still validate the regions. This could mean potentially throwing errors if there are conflicts in the region mapping and the DataStructureDefinition.
My question now is, should I add a validate_region parameter to RegionProcessor.apply(), which determined whether or not region validation is performed before renaming and aggregating? I would implement it so that it defaults to true and would be set to false in process if region is not in dimension.
What are your thoughts @danielhuppmann?

Migrate from IIASA-internal validation prototype

This issue collects useful utility scripts for migrating a scenario-processing repository related to an IIASA Scenario Explorer, I started developing scripts to translate the prototypes to the nomenclature package. I create this issue for future reference.

Variable template

import yaml

with open("<old-file>.yml", "r") as stream:
    variable_config = yaml.load(stream, Loader=yaml.FullLoader)

stream = yaml.dump(
    [{code: attrs} for code, attrs in variable_config.items()]
)

with open("<new-file>.yaml", "w") as file:
    file.write(stream.replace(": .nan\n", ":\n"))

Region mappings

import yaml
from pathlib import Path
from typing import Union, Dict
from nomenclature import (
    DataStructureDefinition,
    RegionProcessor,
    RegionAggregationMapping,
)


def convert_mapping(file: Union[str, Path]) -> Dict:

    with open(file, "r") as f:
        mapping = yaml.safe_load(f)
    mapping["native_regions"] = [
        {key: f"{mapping['model']}|{value}"} if value != "World" else {key: value}
        for key, value in mapping["native_regions"].items()
    ]
    mapping["common_regions"] = [
        {key: value} for key, value in mapping["region_aggregation"].items()
    ]
    del mapping["region_aggregation"]
    return mapping


def convert_to_model_native_region_definition():
    (Path(__file__).parent / "definitions/region/model_native_regions/").mkdir(
        exist_ok=True
    )
    for file in (Path(__file__).parent / "mappings").iterdir():
        mapping = RegionAggregationMapping.from_file(file)
        model_native_region = [
            {
                mapping.model: [
                    nr.target_native_region
                    for nr in mapping.native_regions
                    if nr.target_native_region != "World"
                ]
            }
        ]
        with open(
            Path(__file__).parent
            / "definitions/region/model_native_regions/"
            / file.name,
            "w",
        ) as f:
            yaml.dump(model_native_region, f, indent=2)


def convert_complete_mapping_dir(
    current_mapping_dir: Path = Path(__file__).parent / "region_mapping",
) -> None:
    Path(Path(__file__).parent / "mappings").mkdir(exist_ok=True)

    for file in current_mapping_dir.iterdir():
        new_mapping = convert_mapping(file)
        with open(file.parents[1] / f"mappings/{file.stem}.yaml", "w") as f:
            yaml.dump(new_mapping, f, indent=2, sort_keys=False)
    rp = RegionProcessor.from_directory("mappings").validate_mappings(
        DataStructureDefinition("definitions", dimensions=["region"])
    )


if __name__ == "__main__":

    convert_to_model_native_region_definition()

Perform aggregation and rename

In the ECEMF project, we came across the issue where several region-aggregation configurations for the same variable should be performed. The particular use case: for carbon prices, it can be beneficial to compute weighted averages using alternative weights:

  • the "obvious" solution to use CO2 emissions as weights, but this can lead to strange results if weights go negative (see IAMconsortium/pyam#446)
  • other weights like Final Energy or GDP, which cannot go negative

When using alternative weights, it would be helpful to "remember" the alternative weight (and make it explicit) by changing the variable name, so when region-aggregating the "Price|Carbon" variable with "Final Energy" as weight, the resulting timeseries variable could be renamed to "Price|Carbon (weighted by Final Energy)".

One solution to implement this would be to add a "rename" argument to the variable attributes, e.g.

- Price|Carbon:
    description: Price of carbon (for regional aggregrates the weighted price of carbon
        by subregion should be used)
    unit: [EUR_2020/t CO2, USD_2010/t CO2]
    weight: Final Energy
    rename: Price|Carbon (weighted by Final Energy)

Double-check package config

When installing the package (editable) in a new conda environment, I see the following in the list of installed packages:

nomenclature       0.1.dev14+g7aa3adf
nomenclature-iamc  0.2.dev15+gef3a1ec.d20211217 c:\users\huppmann\github\nomenclature

Seems like something is incorrect with setup.cfg?

Performance improvement for region-aggregation processing

The current implementation iterates over all common-regions, then creating the variable-kwargs-dictionary, then iterating over each variable.

Two ways to significantly improve performance:

  1. create variable-kwargs-dictionary before iterating over common regions
  2. The pyam aggregate_region() method can take a list of variables if there are no additional arguments (weight, method, ...). So the variable-kwargs-dictionary could be distinguished into a "summed variables"-list plus a "other-method variables" dictionary.

Add Logging to region processing

As of right now there is no logging happening in the context of region processing.
This should be changed prior to the first release.

Improve error reporting

As discussed in #45 we should improve the error handling in nomenclature/validation.py:validate.

Currently we report a generic error and all the information about the invalid values is only given in the logging.
The advantage of this pattern is that we run through all validation steps so in a single go we get the complete picture of all invalid values. If we were to throw individual errors we would only get them one by one.
Therefore, any solution should keep this pattern where we collect all errors and only raise a single one at the end of the validation.
Inspiration can be taken from pydantic where all errors are collected and then handed over to an error handler ValidationError. ValidationError is an error itself that acts as a container class for all individual errors.

Relative definition of path in Nomenclature

The relative definition of the path variable in the __init__ of Nomenclature def __init__(self, path="definitions") can lead to problems when the workflow where it is used is not started from the parent directory of definitions.

One possible fix would be to remove the default and require it to be set explicitly in every workflow in the following pattern:

Nomenclature(Path(__file__).absolute().parent / "definitions").validate(df)

Another fix would be checking if the directory specified in path actually exists and throw an error if it doesn't. Might also be a usecase for pydantic as it specifies an FilePath datatype which does this validation automatically (https://pydantic-docs.helpmanual.io/usage/types/#pydantic-types)

Mismatching region results in empty IamDataFrame after processing

The new implementation of the RegionProcessor.apply() does the following to a submitted scenario (if a model-region-mapping exists):

  • keep list of model-native regions (possibly with renaming)
  • perform aggregation of common-regions from model-native regions

and returns the combination of these two. This means that any regions in the scenario data that are not defined as model-native-regions will be dropped as part of the renaming.

This means that if a scenario is submitted without any "native regions", the region-processing returns an empty IamDataFrame, which results in an IndexError: index 0 is out of bounds for axis 0 with size 0 (see this example), not yet sure why.

Question: how should this be treated as part of a scenario upload via the Scenario Explorer?

  1. RegionProcessor.apply() raises a ValueError if the processed scenario is empty
  2. the ixmp-server-workflow module raises an error if all scenario data is "lost" as part of the project-specific workflow
  3. this is actually fine and a user should see a green status (because the workflow performed as expected: no native-regions, no data)

Personally, I think 3 is wrong.

@peterkolp @phackstock, what are your thoughts?

Validation of datetime format

The openENTRANCE project uses subannual time resolution with the datetime format in two flavors:

  • datetime format as a "time" column
  • datetime format without the year in a "subannual" column plus a "year" column

Read the description in the openentrance repository.

PR openENTRANCE/openentrance#129 added validation of the subannual format to the openentrance workflow. This feature should be migrated to this package once a second project requires support for subannual time resolution.

Allow "filtering by DataStructureDefinition"

The currently implemented validate() method raises an error if any row of an IamDataFrame is not defined in the codelist of a DataStructureDefinition.

There is a secondary use case where we want to filter (downselect) an IamDataFrame by the codelists of an DataStructureDefinition.

This could be done already now by the following:

dsd = nomenclature.DataStructureDefinition("<path/to/definitions>")
df = pyam.IamDataFrame("<path/to/file>")
valid_df = df.filter(region=dsd.region, variable=dsd.variable)

This would not ensure that units are correct and all dimensions have to specified explicitly, but is probably good enough for most use cases.

Still, going forward, it might be beneficial to add a more concise way to filter.

Best option, in my opinion:

df_valid = dsd.filter(df)

Alternative (but might cause circular-dependency issues):

df_valid = df.filter(definition=dsd)

Add "dimensions" arg to validation CLI

Projects/repositories will use non-default dimensions in their "definitions", so it would be helpful to add this as an optional argument to nomenclature.testing.assert_valid_structure() and also enable passing this via the validate-project CLI.

Related to #68.

Reuse/consider alignment with iTEM / SDMX code

The iTEM package (transport-energy on PyPI) contains a submodule item.structure that has related, but somewhat distinct, goals to this package.

This issue is to explain/discuss those goals and the resulting design, and to identify code or patterns that can be reused (or adapted) from transport-energy to here, or vice versa.

As a numbered list of points, for easier reference below. Will expand as necessary:

  1. Code is here: https://github.com/transportenergy/database/tree/main/item/structure

  2. The code uses the sdmx1 package, implementing the SDMX information model (IM). The concepts of the “IAMC format” and “templates” map to the IM as follows.

  3. The “IAMC format” is a particular data structure, with a fixed list of dimensions (called “columns”).

    • Each of those dimensions is, usually, represented by a list of codes (i.e. there are no "uncoded" or free-entry values for any of the dimensions).
    • The codes for the “Variable” dimension are long strings like “Emissions|CO2|Transport”. These represent a larger number (≥1) of dimensions for distinct concepts, ‘collapsed’ together by formatting strings.
    • In the example above, “CO2” might be a code for a concept/dimension with the ID EMI_SPECIES, “Transport” for dimension SECTOR.
    • “Emissions” corresponds to what the IM calls a measure, i.e. a concept that answers the question: what is being measured? The mass of emissions.
    • So the “Variable” codes (for this quantity only; more below) are formatted by joining codes for these three dimensions in a certain order.
    • I gave a long explanation of this at pik-primap/primap2#19 (comment)
  4. A “template” usually combines ≥2 quantities.

    • Each quantity is for a single measure.
    • There are different conceptual dimensions relevant for each quantity, e.g. EMI_SPECIES is relevant for the “Emissions” measure/quantity but may not be relevant for a “Population” quantity.
    • By collapsing most of the dimensions into “Variable”, data for these quantities can fit into a common structure; the IAMC data structure.
    • Analysis code typically wants to handle each of these quantities separately and often wants to manipulate the full/original set of dimensions.
  5. The item package describes the basic data structures as follows; see base.py at the above link:

    • Define concept schemes, lists of concepts that might be either (a) dimensions or (b) measures for transport data.
    • For each concept that is coded, define codelists (lists of codes) for allowable values. Some of these are hierarchical.
    • For each quantity to be included, define its data structure, i.e. the specific set of dimensions relevant to that concept, e.g. ACTIVITY_VEHICLE.
    • Define constraints: (un)allowable combinations of different codes across 2+ concepts/codelists. For instance, the “LDV” code for the VEHICLE concept/dimension is only valid in combination with the “P” (Passenger) code for the SERVICE concept/dimension; and not with SERVICE=“F” (see here)
  6. Then a template is prepared by the following algorithm:

    • List ≥1 quantities.
    • For each quantity:
      • Obtain the quantity-specific data structure, including dimensions, and corresponding code lists.
      • Attach constraints for the concepts/dimensions in this structure.
      • Generate the Cartesian product of allowable codes for every dimension. This gives a list of the allowable keys for each quantity.
      • Collapse the key values for the non-IAMC concepts/dimensions into a “Variable” code, using a quantity-specific string formatting operation.
    • Concatenate the lists for each quantity.
  7. Outputs of (6) include:

    • A list of valid keys, with full/original dimensionality keys for each quantity.
    • A new codelist for “Variable” in this particular collection of data structures. This maps 1:1 to the above list.

    The latter (or the list of allowable keys for all dimensions of the collapsed/IAMC data structure) can be written to a CSV or Excel file; this is the file commonly provided as a “template”.

    The mapping between the two lists can be stored and used to restore full/original dimensions for each quantity. This obviates any need to parse “Variable” codes that might be idiosyncratic; the operation is simply a lookup, and data is invalid if the lookup fails.

    The data structures for each quantity can be used to validate data.

Implement processor to "undo" region processing

It can be useful/required to "undo" or reverse the renaming done as part of the region-processing, so taking the model-native-in-ensemble-region-names and rename to model-native-as-used-originally-region-names.

My hunch is that this would be better implemented as a new type of processor (instead of an argument of RegionProcessor.apply() or a new method there), so that we can easily call a list of processors using

process(IamDataFrame, DataStructureDefinition, processor=[ReverseRegionProcessor, SomeOtherProcessor])

Question to be discussed

Whether to drop or keep common regions as part of the operation, and (if that would be optional) how to configure/specify this...

Revise name for the main class

When migrating this work from the prototype developed in https://github.com/openENTRANCE/nomenclature, I kept the name of the main class Nomenclature. However, nomenclature.Nomenclature is not an ideal name...

So... what is a better name for the class that holds several CodeLists, i.e., one list of allowed values per dimension (variable including units, region, ...) to be used in a particular project (e.g., the AR6 WG3 scenario ensemble).

  • Template - in the IAMC context, we frequently use the term "template" for the list of variables/regions used in a project, see also @khaeru's discussion in #10
  • Project - to highlight that this instance holds the naming conventions of a particular project
  • Registry - following the example of https://github.com/sentinel-energy/friendly_data_registry

Raise all duplicate-errors during initialization

Currently, only a unique duplicate-code error is raised during initialization (e.g., from validate-project). It would be great to collect all duplicates and raise one pydantic-error-collection.

Allow single value input for DataStructureDefinition

When creating a new DataStructureDefinition object it would be nice to specify a single dimension just as a string and not a list which only contains one string, i.e. instead of:

dsd = DataStructureDefinition(..., dimensions=["region"])

it would be nice to do this:

dsd = DataStructureDefinition(..., dimensions="region")

The same goes for DataStructureDefinition().validate(df, dimensions=['region']).

Create custom GitHub action for validating projects

As a number of projects use the nomenclature package in form of a GitHub action for validation of their data structure definitions and their region mappings it would be nice to add a GitHub action to this repository.
This GitHub action can the be used directly in the various projects.

This is what this action might look like:

# This workflow validates that the project is consistent with the nomenclature
# For more information see: https://help.github.com/actions/language-and-framework-guides/using-python-with-github-actions

name: Validate the project

on:
  push:
    branches: [ main ]
  pull_request:
    branches: [ '**' ]

jobs:
  project-validation:

    runs-on: ubuntu-latest

    steps:
    - uses: actions/checkout@v2

    - name: Set up Python 3.9
      uses: actions/setup-python@v1
      with:
        python-version: 3.9

    - name: Install requirements
      run: pip install -r requirements.txt

    - name: Run the nomenclature project validation
      run: nomenclature validate-project .

Support *required* variables and regions

Per a discussion with @znicholls and @lewisjared.

The validation currently only checks that variables/regions/... of a scenario (as an IamDataFrame) are defined in the DataStructureDefinition. We have use cases for the opposite direction: whether data for a list of required variables/regions exists in the scenario.

Related: pyam has a method require_variable() - but this only checks that data for that variable exist for any region/year...

There are several possible "levels":

  1. Require that (at least some data for) all variables and regions exist in the IamDataFrame.
  2. Add an attribute required to the variable attributes that have special treatment:
    - Allowed variable name:
        description: A short explanation or definition
        unit: A unit
        required: true
        <other attribute>: Some text, value, boolean or list (optional)
    
  3. Allow a dictionaryas value for the required attribute for a more detailed configuration of "required", e.g., for specific regions or years...

Weighted region aggregation fails when weight is missing

When performing weighted region aggregation it needs to be made sure that both the variable and the weight are present in all model, scenario combinations.
From my point of view it would be desirable if it would produce a warning if there are only a couple of model, scenario combinations where weight and/or variable are missing and an error if that's the case for all.
Maybe this is more well suited to be addressed on the pyam side though.
What are your thoughts on the matter @danielhuppmann?

Use schema validation for nomenclature definitions

Seeing the nice implementation by @phackstock of the model-region-configuration parsing in #22 gave me two ideas how to improve the structure for variables and regions:

  1. Implement a schema validation of the yaml files
  2. Change the structure of the definitions files from a simple dictionary
Primary Energy:
  definition: Total primary energy consumption
  unit: EJ/yr

to a list of dictionaries:

- Primary Energy:
    definition: Total primary energy consumption
    unit: EJ/yr

This would allow to safeguard against duplicate entries within one file (where all identical entries except for the last one would currently be simply ignored).

Raise error if region processing returns empty result

As it currently stands, the region processor assembles the final result of the processing by concatenating a list of IamDataFrame objects stored in processed_dfs (https://github.com/IAMconsortium/nomenclature/blob/main/nomenclature/processor/region.py#L447).
Before that it is checked that the list processed_dfs is not empty, however the way the test is currently done checks only if the list contains at least one IamDataFrame object. This is not ideal since the object can be empty.
In the extreme case processed_dfs contains only empty IamDataFrame objects which does not raise any errors but produces an unexpected result for the user while uploading. The upload appears 'green' in the scenario explorer UI while no data is added.

Adding a check as simple as:

if all(df.empty for df in processed_dfs):
    raise ValueError("...")

should resolve the issue.

Refactor tags to f-string style

The current use of <Tag> could be changed to {Tag} to be closer to Python f-string formatting.

Given that there are currently only two or three repositories using tags, this should be a manageable task to implement the change and update the repos simultaneously.

Thoughts @phackstock?

Allow multiple "model" values in a region-mapping file

The "model" attribute in a region-processing mapping-file currently only takes a string value. It would be beneficial to allow this to be a list so that one region-mapping can be used to several (minor/patch) versions of a model.

So the extended use of the region-mapping file could look like the following (based on openENTRANCE/openentrance#148):

model:
  -  OSeMBE v1.0.0
  -  OSeMBE v1.1.0
  -  OSeMBE v1.2.0
native_regions:
  - AUT: OSeMBE v1|AUT
  ...

 common_regions:
   - EU27:
     - AUT
     ...

or

model:
  -  OSeMBE v1.0.0
  -  OSeMBE v1.0.1
  -  OSeMBE v1.0.2
native_regions:
  - AUT: OSeMBE v1.0|AUT
  ...

 common_regions:
   - EU27:
     - AUT
     ...

depending on the likelihood of region-changes over the model versions

Common region completeness check

When writing tests for #99 I accidentally discovered a possible bug (or at the very least unexpected behavior) in the current implementation of region aggregation.

Issue outline

The bug occurs as follows:

Suppose we have a model mapping:

model: m_a
common_regions:
  - common_region_A:
    - region_A
    - region_B

If we now upload data that only contains region_A and run the aggregation, it will happily spit out an aggregated value for common_region_A. As region_B is missing from the input, however, the values for common_region_A will be identical to the ones for region_A.

Proposed fix

As we perform region aggregation we could check that for every common region we have data for all constituent regions.
One open question would be what to do if we are missing a constituent region as in the example above, should we raise an error or write a warning to the log? My preference would be for error as the result described above is unexpected and could lead to wrong conclusions.

Implement a testing-module CLI

The nomenclature package already has a testing module (see here), which is currently used as part of the automated testing of the irp-internal-workflow repository (see here).

The aim of this testing module is that in any repository which uses the nomenclature package, we can easily set up a GitHub Actions workflow to ensure that all yaml files (which comprise the definitions of variables and regions) can be correctly parsed and yield a valid DataStructureDefinition.

It would be great to implement this as a CLI, so that rather than having to write the (same) tests and set up a GitHub Actions workflow for them, we can just write a workflow that calls the CLI. More importantly, we can then simply add more tests to the nomenclature-testing-CLI and don't have to remember to add individual tests to each scenario-explorer-workflow repository.

As a starting point, the following would be useful:

  • Implement a CLI entrypoint: nomenclature-test [--dir=<path/to/directory>]
  • The CLI runs the two existing tests in the module:
    • Check that all files in the (sub)directory are parseable as yaml files
    • If the directory has a definitions folder, try to initalize a DataStructureDefinition instance from that folder (if no error is raised, the test passes)

PS: This issue came out of a discussion with @phackstock, see item 3 in #26 (comment)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.