iamconsortium / nomenclature Goto Github PK
View Code? Open in Web Editor NEWA package to work with IAMC-style variable templates
Home Page: https://nomenclature-iamc.readthedocs.io/
License: Apache License 2.0
A package to work with IAMC-style variable templates
Home Page: https://nomenclature-iamc.readthedocs.io/
License: Apache License 2.0
When writing tests for #99 I accidentally discovered a possible bug (or at the very least unexpected behavior) in the current implementation of region aggregation.
The bug occurs as follows:
Suppose we have a model mapping:
model: m_a
common_regions:
- common_region_A:
- region_A
- region_B
If we now upload data that only contains region_A
and run the aggregation, it will happily spit out an aggregated value for common_region_A
. As region_B
is missing from the input, however, the values for common_region_A
will be identical to the ones for region_A
.
As we perform region aggregation we could check that for every common region we have data for all constituent regions.
One open question would be what to do if we are missing a constituent region as in the example above, should we raise an error or write a warning to the log? My preference would be for error as the result described above is unexpected and could lead to wrong conclusions.
The nomenclature package already has a testing module (see here), which is currently used as part of the automated testing of the irp-internal-workflow repository (see here).
The aim of this testing module is that in any repository which uses the nomenclature package, we can easily set up a GitHub Actions workflow to ensure that all yaml files (which comprise the definitions of variables and regions) can be correctly parsed and yield a valid DataStructureDefinition.
It would be great to implement this as a CLI, so that rather than having to write the (same) tests and set up a GitHub Actions workflow for them, we can just write a workflow that calls the CLI. More importantly, we can then simply add more tests to the nomenclature-testing-CLI and don't have to remember to add individual tests to each scenario-explorer-workflow repository.
As a starting point, the following would be useful:
nomenclature-test [--dir=<path/to/directory>]
definitions
folder, try to initalize a DataStructureDefinition
instance from that folder (if no error is raised, the test passes)PS: This issue came out of a discussion with @phackstock, see item 3 in #26 (comment)
As of right now there is no logging happening in the context of region processing.
This should be changed prior to the first release.
A use case that emerged from NGFS scenario submissions (but also applies to any other scenario comparison exercise) is to improve the usability of the comparison feature which is part of the partial regions aggregation.
A typical comparison output might look like this, taken from https://data.ece.iiasa.ac.at/ngfs-internal/#/uploads/jobs/189/log:
common-region aggregation
model scenario region variable unit year
MESSAGEix-GLOBIOM 1.1-M-R12 h_cpol World Capital Cost|Electricity|Geothermal US$2010/kW 2020 0.000000 4535.461014
2025 0.000000 4141.136461
2030 0.000000 3748.030892
2035 0.000000 3487.297166
2040 0.000000 3240.241874
... ... ...
o_2c World Yield|Sugarcrops t DM/ha/yr 2070 19.216144 18.438459
2080 19.955667 19.141642
2090 20.586738 19.947336
2100 20.888695 20.636481
2110 20.549795 20.567953
[3876 rows x 2 columns]
which is a good start but now super useful as it leaves out a lot of information.
Some of the things that would be useful are:
pyam.compare
(https://pyam-iamc.readthedocs.io/en/stable/api/general.html).We want to have a zenodo entry for the nomenclature package.
This should update itself for every new release.
Following the discussion in #64, it emerged that the CLI currently looks for folders called "mappings" and "definitions" in cli.cli_valid_project
.
While these two names are a good standard they should be able to be modified as we allow the same when instantiating RegionProcessor
and DataStructureDefinition
.
Projects/repositories will use non-default dimensions in their "definitions", so it would be helpful to add this as an optional argument to nomenclature.testing.assert_valid_structure()
and also enable passing this via the validate-project
CLI.
Related to #68.
Currently, if we have an error validating a RegionAggregationMapping
we only get the cause of the error but not the file where that mapping is saved. In case of a big number of mapping files searching for the correct one can become a tedious task. Especially when thinking about further validation steps, i.e. checking for model collisions and validating against DataStructureDefinition
.
Add the attribute mapping_file
to RegionAggregationMapping
and update the involved methods and tests.
What are your thoughts on that @danielhuppmann?
There is a use case where part of the timeseries data for a region is included in the IamDataFrame (e.g., scenario data uploaded to the IIASA Scenario Explorer), and other part of the data is aggregated from the sub-regions (as is currently done by the region processor).
A simple solution would be to keep any timeseries data for common-regions and only compute the aggregate-region-data for other variables.
Questions to be discussed:
In processor.region.RegionProcessor.apply
we currently check if the list processed_dfs
is empty and raise an error if it is.
In addition to this check we need to perform another check to see if the individual frames are longer than 0.
If all of them are 0 we raise an error.
As discussed in #45 we should improve the error handling in nomenclature/validation.py:validate
.
Currently we report a generic error and all the information about the invalid values is only given in the logging.
The advantage of this pattern is that we run through all validation steps so in a single go we get the complete picture of all invalid values. If we were to throw individual errors we would only get them one by one.
Therefore, any solution should keep this pattern where we collect all errors and only raise a single one at the end of the validation.
Inspiration can be taken from pydantic where all errors are collected and then handed over to an error handler ValidationError
. ValidationError
is an error itself that acts as a container class for all individual errors.
For future releases of the nomenclature package on pypi we want a GitHub action that automatically updates the pypi entry.
Inspiration may be taken from the publish.yaml
workflow from the units package.
There are three issues in the API docs:
nomenclature.RegionProcessor
currently does not work because the API docs are built with automodule, see the comment by @phackstock: #77 (comment)When migrating this work from the prototype developed in https://github.com/openENTRANCE/nomenclature, I kept the name of the main class Nomenclature
. However, nomenclature.Nomenclature is not an ideal name...
So... what is a better name for the class that holds several CodeLists
, i.e., one list of allowed values per dimension (variable including units, region, ...) to be used in a particular project (e.g., the AR6 WG3 scenario ensemble).
The currently implemented validate()
method raises an error if any row of an IamDataFrame is not defined in the codelist of a DataStructureDefinition.
There is a secondary use case where we want to filter (downselect) an IamDataFrame by the codelists of an DataStructureDefinition.
This could be done already now by the following:
dsd = nomenclature.DataStructureDefinition("<path/to/definitions>")
df = pyam.IamDataFrame("<path/to/file>")
valid_df = df.filter(region=dsd.region, variable=dsd.variable)
This would not ensure that units are correct and all dimensions have to specified explicitly, but is probably good enough for most use cases.
Still, going forward, it might be beneficial to add a more concise way to filter.
Best option, in my opinion:
df_valid = dsd.filter(df)
Alternative (but might cause circular-dependency issues):
df_valid = df.filter(definition=dsd)
When creating a new DataStructureDefinition
object it would be nice to specify a single dimension just as a string and not a list which only contains one string, i.e. instead of:
dsd = DataStructureDefinition(..., dimensions=["region"])
it would be nice to do this:
dsd = DataStructureDefinition(..., dimensions="region")
The same goes for DataStructureDefinition().validate(df, dimensions=['region'])
.
Seeing the nice implementation by @phackstock of the model-region-configuration parsing in #22 gave me two ideas how to improve the structure for variables and regions:
Primary Energy:
definition: Total primary energy consumption
unit: EJ/yr
to a list of dictionaries:
- Primary Energy:
definition: Total primary energy consumption
unit: EJ/yr
This would allow to safeguard against duplicate entries within one file (where all identical entries except for the last one would currently be simply ignored).
Opening the issue per discussion with @danielhuppmann.
We want a function which can do region aggregation given two inputs: df: IamDataFrame
and mapping_path: pathlib.Path
. It should return the aggregated IamDataFrame
.
Some open points:
IamDataFrame.rename()
or IamDataFrame.aggregate_region()The relative definition of the path
variable in the __init__
of Nomenclature
def __init__(self, path="definitions")
can lead to problems when the workflow where it is used is not started from the parent directory of definitions
.
One possible fix would be to remove the default and require it to be set explicitly in every workflow in the following pattern:
Nomenclature(Path(__file__).absolute().parent / "definitions").validate(df)
Another fix would be checking if the directory specified in path
actually exists and throw an error if it doesn't. Might also be a usecase for pydantic as it specifies an FilePath
datatype which does this validation automatically (https://pydantic-docs.helpmanual.io/usage/types/#pydantic-types)
Building on #22, the next step should be to integrate RegionAggregationMapping
with DataStructureDefinition
in order to verify that all specified native regions as well as common regions are allowed.
When refactoring the https://github.com/openENTRANCE/nomenclature to work with this new package, I realized that it would be helpful to have a "skip"-option for variables in a DataStructureDefinition.
Options:
Complicated Variable:
- region-processor: skip
Complicated Variable:
- skip-region-processing: true
Or maybe use "exclude" instead of skip?
Any thoughts @phackstock @peterkolp?
The iTEM package (transport-energy
on PyPI) contains a submodule item.structure
that has related, but somewhat distinct, goals to this package.
This issue is to explain/discuss those goals and the resulting design, and to identify code or patterns that can be reused (or adapted) from transport-energy
to here, or vice versa.
As a numbered list of points, for easier reference below. Will expand as necessary:
Code is here: https://github.com/transportenergy/database/tree/main/item/structure
The code uses the sdmx1
package, implementing the SDMX information model (IM). The concepts of the “IAMC format” and “templates” map to the IM as follows.
The “IAMC format” is a particular data structure, with a fixed list of dimensions (called “columns”).
A “template” usually combines ≥2 quantities.
The item
package describes the basic data structures as follows; see base.py
at the above link:
Then a template is prepared by the following algorithm:
Outputs of (6) include:
The latter (or the list of allowable keys for all dimensions of the collapsed/IAMC data structure) can be written to a CSV or Excel file; this is the file commonly provided as a “template”.
The mapping between the two lists can be stored and used to restore full/original dimensions for each quantity. This obviates any need to parse “Variable” codes that might be idiosyncratic; the operation is simply a lookup, and data is invalid if the lookup fails.
The data structures for each quantity can be used to validate data.
The RegionProcessor currently silently removes all regions that are not native regions (and once PR #99 is merged, common regions).
This can be confusing for users who successfully upload a data file to the Scenario Explorer but then do not find the removed regions in the Explorer.
Question:
Option 1 is more annoying for users, but Option 2 has the caveat that users usually do not look at the log file, in particular if the color is green and some data arrived in the database.
Preferences? @phackstock @Renato-Rodrigues @JohannesEmm @HauHe @robertpietzcker
As it currently stands, the region processor assembles the final result of the processing by concatenating a list of IamDataFrame
objects stored in processed_dfs
(https://github.com/IAMconsortium/nomenclature/blob/main/nomenclature/processor/region.py#L447).
Before that it is checked that the list processed_dfs
is not empty, however the way the test is currently done checks only if the list contains at least one IamDataFrame
object. This is not ideal since the object can be empty.
In the extreme case processed_dfs
contains only empty IamDataFrame
objects which does not raise any errors but produces an unexpected result for the user while uploading. The upload appears 'green' in the scenario explorer UI while no data is added.
Adding a check as simple as:
if all(df.empty for df in processed_dfs):
raise ValueError("...")
should resolve the issue.
Currently, only a unique duplicate-code error is raised during initialization (e.g., from validate-project
). It would be great to collect all duplicates and raise one pydantic-error-collection.
This issue is in response to openENTRANCE/openentrance#130.
There is an issue that can occur related to the order of operations for validation and region aggregation.
The way these two steps were originally supposed to be run is like this:
dsd.validate(df)
df = rp.apply(df)
That brings a problem because df
contains might contain regions that are not in the DataStructureDefinition. That might not be a problem though as these regions might be dropped or renamed as part of the region processing.
This brought the following workaround:
# only validate the variables (or any other non region dimensions) at this point
dsd.validate(df, dimensions=["variable"])
df = rp.apply(df)
This works as the model mapping is checked in the region processor, upon instantiation so any illegal regions are going to be detected. There is, however, one case in which this falls short. If we have a model for which we don't have a model mapping defined nothing is checked so we would still need to verify the resulting data frame explicitly.
# only validate the variables (or any other non region dimensions) at this point
dsd.validate(df, dimensions=["variable"])
df = rp.apply(df)
dsd.validate(df, dimensions=["region"])
This is not the cleanest implementation and can easily lead to wrong errors or illegal regions in the data base.
The crux of the problem is that the question whether or not the DataStructureDefinition
should validate the regions of the incoming df
directly hinges on whether or not a model mapping is defined for the data inside df
:
RegionProcessor
has already taken care of this step so the incoming regions must not be validated.DataStructureDefinition
as there is no other validation step.The information of the existence of a model mapping is only available to the RegionProcessor
, not the DataStructureDefinition
.
Therefore I would propose two different solutions:
RegionProcessor
. By creating RegionProcessor.validate()
we can iterate through the models inside the dataframe and decide on a per model basis whether or not we validate the regions. This would give something like this:rp.validate(df)
df = rp.apply(df)
RegionProcessor.apply()
. This would take minimal effort as there is already an if clause in RegionProcessor.apply()
that checks whether or not a model mapping is present. As of right now it only logs the fact that there is none and then returns the data unchanged. Adding region validation here would be a single line. With this the validation process would look like this:dsd.validate(df) # per default anything except regions
df = rp.apply(df) # apply region check if no model mapping is found
What are your thoughts @danielhuppmann? I hope I got the description of the problem correct.
All items in a codelist have an attribute file
to identify from which file the code was imported - this might be helpful in debugging in more complex nomenclatures.
The file name should be saved relative to the definitions folder, not relative to the working directory.
The openENTRANCE project uses subannual time resolution with the datetime format in two flavors:
Read the description in the openentrance repository.
PR openENTRANCE/openentrance#129 added validation of the subannual format to the openentrance workflow. This feature should be migrated to this package once a second project requires support for subannual time resolution.
Title says most of it. Currently when ready definition files into a code lists, whenever we encounter an error we do not report the file where it originated.
This makes troubleshooting more painful than it needs to be.
The new implementation of the RegionProcessor.apply()
does the following to a submitted scenario (if a model-region-mapping exists):
and returns the combination of these two. This means that any regions in the scenario data that are not defined as model-native-regions will be dropped as part of the renaming.
This means that if a scenario is submitted without any "native regions", the region-processing returns an empty IamDataFrame, which results in an IndexError: index 0 is out of bounds for axis 0 with size 0
(see this example), not yet sure why.
Question: how should this be treated as part of a scenario upload via the Scenario Explorer?
RegionProcessor.apply()
raises a ValueError if the processed scenario is emptyixmp-server-workflow
module raises an error if all scenario data is "lost" as part of the project-specific workflowPersonally, I think 3 is wrong.
@peterkolp @phackstock, what are your thoughts?
This is the result from a conversation with @pweigmann.
The partial region aggregation feature can lead to unexpected results if it is only comprised of a single common region, e.g.:
model: model_a
common_regions:
- Region_A:
- region_a
this configuration is not super rare as it can often be the case that a region, that a model natively reports, region_a
in the example, falls under the category of a comparison region. In this case, no aggregation is necessary, a renaming from region_a
to Region_A
is sufficient to ensure comparability with other models.
This is where unexpected behavior can occur as common_regions
in essence means "perform region-aggregation". At the same time, region aggregation works on a principle we call "partial region-aggregation" (https://nomenclature-iamc.readthedocs.io/en/latest/usage.html#partial-region-aggregation) which means that nomenclature will perform the aggregation of the constituent regions but only use that for comparison. The results of the processing will be the model native results.
However, this only works if the name of the resulting common region, in our case Region_A
remains the same, i.e.:
model: model_a
common_regions:
- Region_A:
- Region_A
if we have a renaming only the aggregated results will be used.
This is normally not a problem as they should be identical to the provided model native values.
However, in a variable uses weighted aggregation such as Price|Carbon
and the weight variable contains negative values we lose data points where the weight is negative.
There are a few options that this problem can be fixed/circumvented:
native_regions
section.My preference would be option 1. as it is cleaner to implement, the region aggregation comparison feature does not really make sense if the region is only comprised of a single region, and there is another related issue with the current implementation:
Assuming a model natively reports two regions, "region_a" and "region_b". Assuming further that they should both be renamed to "Region_A" and "Region_B" respectively. Minus the potential issues with missing data, the following model mapping works:
model: model_a
native_regions:
- region_a: Region_A
common_regions:
- Region_B:
- region_b
The inconsistency here is that for native_regions
the value on the right is the one that will be in the result, For common_regions
(here used for renaming only), on the other hand, the one on the left/above is the desired one.
One way to solve this would be go with option 1. mentioned above and disallow single-constituent common regions. This would also be in line with the zen of python "There should be one-- and preferably only one --obvious way to do it.". In this case the renaming of model native regions could be done only by adding it to the native_regions
list.
If we go with this option we might also have to reconsider the naming of the two categories native_regions
and common_regions
as they would then more closely correspond to do-not-aggreate
and aggregate
.
@danielhuppmann, please let me know what you think.
Also @pweigmann, @orichters and @Renato-Rodrigues (you guys came to mind first as modellers that have current experience with using nomenclature, please feel free to loop in other colleages), your input is most welcome.
When scanning a directory for region aggregation mappings, only .yaml
files are found while .yml
is ignored.
This issue collects useful utility scripts for migrating a scenario-processing repository related to an IIASA Scenario Explorer, I started developing scripts to translate the prototypes to the nomenclature package. I create this issue for future reference.
import yaml
with open("<old-file>.yml", "r") as stream:
variable_config = yaml.load(stream, Loader=yaml.FullLoader)
stream = yaml.dump(
[{code: attrs} for code, attrs in variable_config.items()]
)
with open("<new-file>.yaml", "w") as file:
file.write(stream.replace(": .nan\n", ":\n"))
import yaml
from pathlib import Path
from typing import Union, Dict
from nomenclature import (
DataStructureDefinition,
RegionProcessor,
RegionAggregationMapping,
)
def convert_mapping(file: Union[str, Path]) -> Dict:
with open(file, "r") as f:
mapping = yaml.safe_load(f)
mapping["native_regions"] = [
{key: f"{mapping['model']}|{value}"} if value != "World" else {key: value}
for key, value in mapping["native_regions"].items()
]
mapping["common_regions"] = [
{key: value} for key, value in mapping["region_aggregation"].items()
]
del mapping["region_aggregation"]
return mapping
def convert_to_model_native_region_definition():
(Path(__file__).parent / "definitions/region/model_native_regions/").mkdir(
exist_ok=True
)
for file in (Path(__file__).parent / "mappings").iterdir():
mapping = RegionAggregationMapping.from_file(file)
model_native_region = [
{
mapping.model: [
nr.target_native_region
for nr in mapping.native_regions
if nr.target_native_region != "World"
]
}
]
with open(
Path(__file__).parent
/ "definitions/region/model_native_regions/"
/ file.name,
"w",
) as f:
yaml.dump(model_native_region, f, indent=2)
def convert_complete_mapping_dir(
current_mapping_dir: Path = Path(__file__).parent / "region_mapping",
) -> None:
Path(Path(__file__).parent / "mappings").mkdir(exist_ok=True)
for file in current_mapping_dir.iterdir():
new_mapping = convert_mapping(file)
with open(file.parents[1] / f"mappings/{file.stem}.yaml", "w") as f:
yaml.dump(new_mapping, f, indent=2, sort_keys=False)
rp = RegionProcessor.from_directory("mappings").validate_mappings(
DataStructureDefinition("definitions", dimensions=["region"])
)
if __name__ == "__main__":
convert_to_model_native_region_definition()
In the ECEMF project, we came across the issue where several region-aggregation configurations for the same variable should be performed. The particular use case: for carbon prices, it can be beneficial to compute weighted averages using alternative weights:
When using alternative weights, it would be helpful to "remember" the alternative weight (and make it explicit) by changing the variable name, so when region-aggregating the "Price|Carbon" variable with "Final Energy" as weight, the resulting timeseries variable could be renamed to "Price|Carbon (weighted by Final Energy)".
One solution to implement this would be to add a "rename" argument to the variable attributes, e.g.
- Price|Carbon:
description: Price of carbon (for regional aggregrates the weighted price of carbon
by subregion should be used)
unit: [EUR_2020/t CO2, USD_2010/t CO2]
weight: Final Energy
rename: Price|Carbon (weighted by Final Energy)
When installing the package (editable) in a new conda environment, I see the following in the list of installed packages:
nomenclature 0.1.dev14+g7aa3adf
nomenclature-iamc 0.2.dev15+gef3a1ec.d20211217 c:\users\huppmann\github\nomenclature
Seems like something is incorrect with setup.cfg?
Per a discussion with @znicholls and @lewisjared.
The validation currently only checks that variables/regions/... of a scenario (as an IamDataFrame) are defined in the DataStructureDefinition. We have use cases for the opposite direction: whether data for a list of required variables/regions exists in the scenario.
Related: pyam has a method require_variable()
- but this only checks that data for that variable exist for any region/year...
There are several possible "levels":
required
to the variable attributes that have special treatment:
- Allowed variable name:
description: A short explanation or definition
unit: A unit
required: true
<other attribute>: Some text, value, boolean or list (optional)
required
attribute for a more detailed configuration of "required", e.g., for specific regions or years...The "model" attribute in a region-processing mapping-file currently only takes a string value. It would be beneficial to allow this to be a list so that one region-mapping can be used to several (minor/patch) versions of a model.
So the extended use of the region-mapping file could look like the following (based on openENTRANCE/openentrance#148):
model:
- OSeMBE v1.0.0
- OSeMBE v1.1.0
- OSeMBE v1.2.0
native_regions:
- AUT: OSeMBE v1|AUT
...
common_regions:
- EU27:
- AUT
...
or
model:
- OSeMBE v1.0.0
- OSeMBE v1.0.1
- OSeMBE v1.0.2
native_regions:
- AUT: OSeMBE v1.0|AUT
...
common_regions:
- EU27:
- AUT
...
depending on the likelihood of region-changes over the model versions
As I was updating the documentation for #74 I found a possible corner case that we are currently not covering.
Let's assume the following:
DataStructureDefinition
with a region dimension.RegionProcessor
.process
.As part of RegionProcessor.apply
we currently check that the renaming and aggregation complies with what's given in the DataStructureDefinition
. This means that even though we might provide a custom dimension list that does not contain region, RegionProcessor.apply
will still validate the regions. This could mean potentially throwing errors if there are conflicts in the region mapping and the DataStructureDefinition
.
My question now is, should I add a validate_region
parameter to RegionProcessor.apply()
, which determined whether or not region validation is performed before renaming and aggregating? I would implement it so that it defaults to true and would be set to false in process
if region is not in dimension.
What are your thoughts @danielhuppmann?
The current documentation sections "Getting Started" and "Usage" are not as useful or easy-to-understand as they should be.
Here is a suggestion for a new structure:
nomenclature validate-project .
The Usage pages should be aimed at novice Python users, so avoiding Python jargon. Also, "Notes" is probably not an idea header, because it does not tell a user what kind of notes to expect... Better to have descriptive sub-section titles.
The current implementation iterates over all common-regions, then creating the variable-kwargs-dictionary, then iterating over each variable.
Two ways to significantly improve performance:
aggregate_region()
method can take a list of variables if there are no additional arguments (weight, method, ...). So the variable-kwargs-dictionary could be distinguished into a "summed variables"-list plus a "other-method variables" dictionary.Building on #22, one of the next steps in verification is to check that there are no collisions within all of the provided mappings.
Illustration of one failure scenario:
Given mapping_1
:
name: model_1
native_regions:
- native_region_a: alternative_name
mapping_2
:
name: model_2
native_regions:
- alternative_name
Alternatively it could also be that in mapping_2
we have a region called native_region_b
which we end up renaming alternative_name
.
In any case a next step in region mapping could be to check for this.
@danielhuppmann, @peterkolp is this really an issue or is this actually fine? If it is fine, feel free to close the issue. If it is something that should be checked are there any other failure scenarios that should be checked? For the common regions it is of course expected that they share the same name.
It can be useful/required to "undo" or reverse the renaming done as part of the region-processing, so taking the model-native-in-ensemble-region-names and rename to model-native-as-used-originally-region-names.
My hunch is that this would be better implemented as a new type of processor (instead of an argument of RegionProcessor.apply()
or a new method there), so that we can easily call a list of processors using
process(IamDataFrame, DataStructureDefinition, processor=[ReverseRegionProcessor, SomeOtherProcessor])
Whether to drop or keep common regions as part of the operation, and (if that would be optional) how to configure/specify this...
It is often useful to validate that values of timeseries are within a specified range.
Similar packages implement constraints:
When performing weighted region aggregation it needs to be made sure that both the variable and the weight are present in all model, scenario combinations.
From my point of view it would be desirable if it would produce a warning if there are only a couple of model, scenario combinations where weight and/or variable are missing and an error if that's the case for all.
Maybe this is more well suited to be addressed on the pyam side though.
What are your thoughts on the matter @danielhuppmann?
Set up a proper Sphinx-docs structure with the following sections:
We should include a check that assures that we only have one mapping per model.
From my point of view the main question in terms of implementation would be how to handle these tasks which are related to an ensemble of mappings. If we only only need this one check I would say we can do it inside a function that creates a dictionary should do, like something along those lines:
def read_all_mappings(mapping_dir: Union[str, path]) -> dict:
region_mapping_dict = {}
for mapping_file in mapping_dir:
mapping = RegionAggregationMapping(mapping_file)
if mapping.model in region_mapping_dict:
raise ModelMappingOverlapError()
return region_mapping_dict
>>> read_all_mappings(Path("mappings/"))
{"model_1": RegionAggregationMapping("model_1"), "model_2": RegionAggregationMapping("model_2")}
However, as we might want to have more functions associated with a collection of RegionAggregationMapping
objects, we might want to package that in another, possibly also pydantic
, class. Like so:
from pydantic import BaseModel
class RegionAggregationCollection(BaseModel):
mappings: Dict[str, RegionAggregation]
@classmethod
def read_all_mappings(cls, mapping_dir: Union[str, path]) -> type(cls):
# implementation see read_all_mappings above
def __getitem__(self, key: str):
return mappings[key]
The __getitem__
method is emulate the dict behavior so that calls like RegionAggregationCollection["model_1"]
are possible. In this class we could then also add other validation functions or other utilities.
My personal preference would be for option 2 as it continues the pattern so far using pydantic and all functions related to a group of region aggregation mappings, naturally fit there
There are some leftover references to Nomenclature
, the old name of the DataStructureDefinition
class.
From taking a quick look I've found references in validation.py
, conftest.py
, test_core.py
, test_region_aggregation.py
, test_testing.py
and test_validation.py
.
The current use of <Tag>
could be changed to {Tag}
to be closer to Python f-string formatting.
Given that there are currently only two or three repositories using tags, this should be a manageable task to implement the change and update the repos simultaneously.
Thoughts @phackstock?
As a number of projects use the nomenclature package in form of a GitHub action for validation of their data structure definitions and their region mappings it would be nice to add a GitHub action to this repository.
This GitHub action can the be used directly in the various projects.
This is what this action might look like:
# This workflow validates that the project is consistent with the nomenclature
# For more information see: https://help.github.com/actions/language-and-framework-guides/using-python-with-github-actions
name: Validate the project
on:
push:
branches: [ main ]
pull_request:
branches: [ '**' ]
jobs:
project-validation:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v2
- name: Set up Python 3.9
uses: actions/setup-python@v1
with:
python-version: 3.9
- name: Install requirements
run: pip install -r requirements.txt
- name: Run the nomenclature project validation
run: nomenclature validate-project .
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.