equinor / fmu-dataio Goto Github PK
View Code? Open in Web Editor NEWFMU data standard and data export with rich metadata in the FMU context
Home Page: https://fmu-dataio.readthedocs.io/en/latest/
License: Apache License 2.0
FMU data standard and data export with rich metadata in the FMU context
Home Page: https://fmu-dataio.readthedocs.io/en/latest/
License: Apache License 2.0
fmu-dataio/src/fmu/dataio/_export_item.py
Line 409 in b35a317
ValueError: time data '2' does not match format '%Y%m%d'
Input to dataio is e.g.:
timedata=["20150101", "20210101"]
Before:
relative_path: share/results/maps/some.gri
Shall be relative to case directory, so that aggradations also use the same root
relative_path: realization-88/iter-2/share/results/maps/some.gri
The argument access_ssdl
is valid, but giving it to ExportData
has no effect. Default values from global_variables are still used.
Make workflow for pypi uploading when tagging a release
The introduction of semeio
creates some major obstacles for using fmu-dataio
inside RMS 11.*
and 12.0
. The reason for this is that RMS 12.0 (python 3.6.1)
force the use of specific old numpy (1.13.3)
and pandas (0.21)
versions which cannot be replaced. The semeio
is only applied here for a very minor part of the code, but it forces upgrade of numpy and pandas when installed (it have a large list of dependencies)
I propose reverting the code; take out semeio until we do not longer use RMS 11. and RMS 12.0.**
It would be useful if the tracklog for cases recorded the runs made. Currently, I assume it will simply be overwritten. Not necessarily urgent, but will require that the program first checks for existing case metadata, then simply returns it with an event appended to the tracklog - or creates new if it does not exist.
It might also be worth including some sanity-checks to catch very strange behavior.
Currently, regularsurface
is used as class. Used to be surface
?
surface
seems better, then handle the layout with the layout
attribute. Logic in various places (clients) is placed on the class
attribute, for which the type of surface is not relevant. Having regularsurface
as a class invokes having to also have (at least) structuredsurface
(or similar) as another class, which for many clients will then result in having to combine these two different classes.
Suggest using surface
consistently, irrespective of layout and/or format.
Reference implementation for parameters.txt
-parsing is in
https://github.com/equinor/semeio/blob/master/semeio/jobs/design_kw/design_kw.py
This is not in line with:
fmu-dataio/src/fmu/dataio/_utils.py
Line 220 in c3111c9
which interprets a nested dictionary structure from the keys based on colons in the key string.
Semeio actively ignores prefixes before colons in keys.
Calculate file size in bytes and include in the file
tag, next to checksum_md5
fmu-dataio/src/fmu/dataio/dataio.py
Line 246 in 081f586
*reference
Passing input arguments to fmu.dataio.ExportData in ERT forward jobs may be tricky given the amount of arguments, and the possible variation between different jobs, assets and users. Suggest defining a section in global_variables, and using that as a (alternative fallback) transporter of input arguments.
Pattern 1:
CFG = read_yaml(global_variables.yml)
exp = ExportData(
config=CFG,
arg1=x
arg2=y,
arg3=z,
)
Pattern 2:
CFG = read_yaml(global_variables.yml)
exp = ExportData(
config=CFG,
)
...where global_variables.yml contains:
arg1: x
arg2: y
arg3: z
...in a block that is known to fmu-dataio.
Could be that name
could be the key in global_variables, alternatively a separate argument could specify it, e.g.:
CFG = read_yaml(global_variables.yml)
exp = ExportData(
config=CFG,
kwargs=CFG["fmu-dataio"][somename]
)
The purpose is to enable fmu-dataio to be used in forward_models without having to pass a long list of (possibly varying) input arguments. Rather use global_variables.yml
for that.
Currently, fmu-dataio is sensitive to current working directory. It will not work when pwd is on realization root and/or config path or other known contexts when running with ERT.
These are the known contexts that must be supported:
Dataframe is exported with the index included. Not sure what is best practice here, but I think exporting without should be either default or an option.
fmu-dataio/src/fmu/dataio/_export_item.py
Line 803 in 9074d45
Based on discussions with Sumo and REP. Include a unique ID on the realization
and iteration
level. The purpose of this is to enable future use cases. Technically, uniqueness can be achieved by any client by combining case.uuid
, iteration.id
and realization.id
but if this is injected into other database systems, a proper uuid4
will be necessary and beneficial.
case:
uuid: 36e30888-7dc9-4d91-a525-8bf8e0e67cce <- Generated directly, perhaps more sophisticated in the future
iteration:
id: 0
name: "iter-0"
uuid: 8e5bd5ed-636d-4d38-bcd8-2e538154beda <- Valid uuid4, hash of case.uuid + iteration.id
realization:
id: 0
name: "realization-0"
uuid: 59c69f7d-ce4d-45cb-9da5-f3d13a2fc1b8 <- Valid uuid4, hash of case.uuid + iteration.id + realization.id
Something weird with the behavior of warnings from inside RMS.
warnings.warn(
"Exporting to a subfolder is a deviation from the standard "
"and could have consequences for later dependencies",
UserWarning
)
...gives this:
UserWarning: Exporting to a subfolder is a deviation from the standard and could have consequences for later dependencies
collected/processed and subsequently both 'independent' and object dependent
Originally posted by @perolavsvendsen in #65 (comment)
Encountered data in live example that requires grid_model included (only thing separating otherwise identical data objects).
fmu-dataio/src/fmu/dataio/dataio.py
Line 268 in 9074d45
fmu-dataio/src/fmu/dataio/_export_item.py
Line 31 in 081f586
The workflow parameter can be given as anything, and will be directly inserted into outgoing metadata. However, it shall be an object and there are also snippets in the code that suggests that object (dict) is enforced.
String input is expected:
fmu-dataio/src/fmu/dataio/dataio.py
Line 95 in 081f586
Dictionary is made:
fmu-dataio/src/fmu/dataio/dataio.py
Line 245 in 081f586
...but actual behavior is that if a string is given as an argument, the string is carried through fmu-dataio into the outgoing metadata where it is still a string.
Needs some investigation.
Also, there is a typo here that should be fixed:
fmu-dataio/src/fmu/dataio/dataio.py
Line 246 in 081f586
Visual settings is included in the metadata schema, but not explicitly set by fmu-dataio. Legacy usage of metadata rests on a lot of visual settings being set while generating data, and in the JS workflow defaults have been included in global_variables
. This is a pattern that could be continued into fmu-dataio
possibly, combined with the option of giving visual settings as input argument.
There is a larger discussion here, however, relating to the responsibility of the data producer vs the data consumer. I think it might be a better (?) pattern to let clients decide visual settings as much as possible. That said, however, visual settings could be the result of domain-specific rules and know-how which could be difficult to convey if not explicitly part of the metadata.
Suggestion, for discussion:
Create a separate VisualSettings
object to hold (and calculate) visual settings and pass that to fmu-dataio.ExportData
as an argument. If not passed, a skeleton version with default values is included. Experience from JS is that this can quickly escalate, which is why I think keeping it as a separate module is wise. Also, given that patterns might change, I think keeping the interface with the rest of fmu-dataio
slim is smart.
revision
in outgoing metadata should be the actual revision, not the name of the folder. Folder name can be anything, up to each user.
Currently, parameters
are key:value pairs. To prepare for future possible expansions, consider making this into an object, e.g.
parameters:
SENSNAME:
value: VALUE
description: MyDescription
fluid_contact
has additional required information, but passing just the string fluid_contact
as contact will not fail.
Wanted behavior: When extra is required, ValidationError
should be raised when extra is missing.
fmu-dataio/src/fmu/dataio/_export_item.py
Line 55 in a5627f4
Currently, in ExportItem._data_process_content, the validation is only triggered if useextra
is defined. It is undefined by default. But it is only defined if a dictionary is passed. Hence, if a string is passed, validation will not trigger.
fmu-dataio/src/fmu/dataio/_export_item.py
Line 275 in a5627f4
In the JS workflow, this is referred to as kxh
to avoid confusion with Kh/horizontal permeability.
Outgoing metadata for tables gets string value Nan
for the undef
field.
Expected behavior: Either a valid number (-999.25 or similar) or nothing.
Tail of stack:
RuntimeError: Unexpected structure of parameters.txt, line is: :::::::INIT_FILES:AQUA_TRANS_FACTOR_NORM -0.89905
It fails on the first line of the file, which looks like this:
INIT_FILES:AQUA_TRANS_FACTOR_NORM -0.89905
This seems to be unrelated to the first line, as it still fails when putting in a dummy entry:
File:
TST 1
Tail of stack:
RuntimeError: Unexpected structure of parameters.txt, line is: :::::::::::::::::::::::::::::::::::::TST::::1```
To be further described. Short story: When using filename tags, they are not always used in the outgoing filename. This may be related to spaces, commas, dots etc in filenames.
fmu-dataio/src/fmu/dataio/_export_item.py
Line 280 in 081f586
This requires user to access the code base to find out which elements is in the list. List of valid contents should be printed when validation fails.
In the frontpage example:
def export_some_surface():
srf = xtgeo.surface_from_file("top_of_some.gri")
exp = ExportData(
config=CFG,
content="depth",
unit="m",
vertical_domain={"depth": "msl"},
timedata=None,
is_prediction=True,
is_observation=False,
tagname="Some Descr",
verbosity="WARNING",
)
exp.to_file(srf)
the usage of exp.to_file
clashes with the similar function pandas.DataFrame.to_csv
(and similar functions elsewhere) where the latter takes the filename to write to as its first positional argument, while exp.to_file()
takes the data to export as its argument. This might lead to confusion.
Possible solutions:
exp
object at initialization, or as a separate function call, and potentially fail if data is passed on to to_file()
.to_file()
so that you have to write e.g. exp.to_file(data=srf)
Currently, display_name is not using the given display_name in global_variables if this exists.
fmu-dataio/src/fmu/dataio/_export_item.py
Line 688 in b35a317
Expected behavior: If user puts display_name in the stratigraphic-block of global_variables, this should be used.
In case of DESIGN_KW
but not GEN_KW
, the parameters.json
will not be generated.
Hence it is more failsafe to parse parameters.txt
, as parameters.json
only represents GEN_KW
.
In case neither GEN_KW
nor DESIGN_KW
then no parameters.*
will be made, which must be possible.
To minimize footprint in other scripts, suggest that fmu-dataio incorporates the parsing of global_variables.yml
. When the run context is known, the (default) location of global_variables should also be known.
Current:
CFG = read_yaml(global_variables.yml)
exp = ExportData(
config=CFG
)
Proposed:
exp = ExportData()
Alternatively:
exp = ExportData(
config_path="path/to/config"
)
Currently, JSON schema definitions + examples are in a separate repo. However, validation is being included into fmu.dataio and some tests involve validation. This means that there could be a risk for out-of-synch.
We could consider merging the two (e.g. putting JSON schema definitions into this repo). This could become complicated wrt revisions etc, so must be discussed.
When exporting data with names that include dots (.), the filenames are wrongly defined.
Example:
Horizon name in RMS: MyFormation_1.1
--> Filename: myformation_1.gri
This is problematic for several reasons, but especially when multiple horizons get the same outgoing filename.
Example:
Horizon names MyFormation_1.1
, MyFormation_1.2
and MyFormation_1.3
will be exported with the same filename, hence only the last one will be exported. Also, if there is a MyFormation_1
present, this last one will look like it is this one.
Purpose of the issue is to describe dependencies and agree on some standards. I assume some (all?) of these must be baked into this code base.
Prototype referred to: Johan Sverdrup implementation + existing Drogon temporary code.
Ensemble metadata
In prototype: <iter>\share\metadata\fmu_ensemble.yaml
on \scratch, on ensemble level.
Example: \scratch\osefax\peesv\mycase\share\iter-0\metadata\fmu_ensemble.yaml
This file is copied into each realization during runtime by ERT postsim hook workflow, placed under share\runinfo\fmu_ensemble.yaml
to contain each realization (avoid that realization goes out of itself to get data).
In prototype, ensemble metadata is handled by a completely separate script intended to be run through ERT workflows.
Suggestion ๐
Same as prototype, except suggest to embed ensemble metadata into the same code base as other metadata and use the same core functionality.
Reruns
In prototype, if ensemble metadata already exists, a sanity check is performed (same case name, etc). If sanity check passes, the existing ensemble metadata is kept, but an item is appended to the runs
block.
Suggestion ๐
Keep the general behavior as in the prototype. When ensemble metadata is first made, establish the metadata and put created
as the first item in events
. If metadata already exists, perform sanity check, then overwrite existing metadata. Keep the existing events
but append an updated
-item. (Suggest to use this behavior for all metadata!)
Location: the ensemble-level share
folder. For an iteration, this would be scratch\<asset>\<user>\<case>\<iter>\share\metadata\fmu_ensemble.yaml
.
Template metadata
In prototype: fmu_template.yaml
on revision root.
Example: \resmod\ff\21.0.0\users\user\21.0.0_mycopy\fmu_template.yaml
This file is copied into the realization during runtime, with the same location.
Suggestion ๐
Embed into global_variables
under a metadata
tag.
Read it during export jobs from already standard location of global_variables
.
Awareness of realization
Each realization must be aware of its own realization id, as this is part of the metadata.
In prototype: The realization is injected into global_variables
by an ERT FORWARD_JOB.
Example: fmu_realization: 41
Suggestion ๐
Continue to place it in global_variables
by a FORWARD_JOB, but follow the proposed metadata structure:
{realization: {id: 41}}
Template runs vs FMU runs
Must work both when running template directy (RMS GUI or other settings) and in FMU runs through ERT. In current prototype, the following behavior is used:
fmu_ensemble
block is produced. Realization
tag is exported, with value null
. These are not valid metadata according to standards, but can still be used for debugging.realization |
ensemble metadata |
Behavior | Description |
---|---|---|---|
X | FAIL | Normal run, missing ensemble metadata | |
X | FAIL | Normal run, missing realization | |
X | X | OK | Normal run |
OK | Template run |
Currently, default file locations are used (share/results/<datatype>
). This can get confusing when one datatype is represented by another. E.g. "polygons" is exported to share/results/polygons when an XTgeo polygon object is used, but when multiple polygons are gathered in the same file, they go to share/results/table.
Suggest to allow for overriding where to export so that a pandas dataframe can also be exported to polygons
.
Also, with lots of export, this can become messy as some local interaction still happens with the exported files. Consider allowing output_subfolder
as an input argument.
E.g. if output_subfolder
is given, export will happen to default/output_subfolder
.
Example:
output_subfolder="MyFolder"
--> Export folder: share/results/<datatype>/myfolder
Use cases for this:
Currently, fmu-dataio
is responsible for exporting data + metadata to (local) disk. Other workflows and functions are responsible for uploading those same data to Sumo. A suggestion is to include upload functionality to fmu-dataio
. This would have several benefits, such as:
Current situation is the following dependency graph:
[data producing script] -> [disk] -> fmu-dataio -> [disk] -> fmu-sumo
"produce the file" -> "produce metadata" -> "upload"
We want to pull these together as much as possible. For several scripts, particularly within RMS, fmu-dataio is embedded as a dependency in the data-producing script avoiding the extra [disk] layer. When it comes to upload, the natural pattern is to pull fmu-sumo in as a dependency to fmu-dataio.
Comment: Most data-producing scripts in FMU should not have knowledge of Sumo. Knowledge of Sumo should sit within fmu-dataio.
The role of fmu-dataio
is to export (and in the future, import) data. It is not specific to "store data on local disk". Hence storing data in Sumo is similar to storing data on Sumo (or any other place).
Requirements
Design options and considerations:
Suggest embedding upload functionality into dataio.py
and invoke it through arguments in the existing .export
method, i.e.:
exp = ExportData(...)
exp.export(obj, sumo=True, local=True)
...where "local" defaults to True
and "sumo" defaults to False
from day 0. Both can be True
.
Further, to avoid excessive network traffic, batch upload is preferred:
E.g. this would be bad:
for obj in my_data_objects:
exp = ExportData(...)
exp.export(obj, sumo=True)
...while this would be better:
exp = ExportData(sumo=True) # authentication happens here (once)
for obj in my_data_objects:
exp.export(obj, sumo=True)
To be further refined!
When exporting a surface with nans only, the bbox.zmin
and bbox.zmax
in the metadata gets illegal value .nan
.
Expected behaviour:
Exporting only Nan's feels weird but should be allowed, as there are multiple possible use cases. However, metadata should be valid. Not sure what the best behavior is. Possibly, zmin/zmax should be null
in these cases. It's still conceptually tricky.
Reproducing:
# python
from pathlib import Path
from xtgeo import RegularSurface
from fmu.dataio import ExportData
import numpy as np
# make surface with only undefined values
nans = np.empty((33,50))
nans[:] = np.nan
surf = RegularSurface(ncol=33, nrow=50, xori=34522.22, yori=6433231.21,
xinc=25.0, yinc=25.0, rotation=30,
values=nans)
exp = ExportData(
content="fluid_contact",
unit="m",
vertical_domain={"depth": "msl"},
timedata=None,
is_prediction=True,
is_observation=False,
)
exp._pwd = Path("./tmp/tmp")
exp.to_file(surf)
# Results in...
# bbox:
# zmin: .nan
# zmax: .nan
Contents of jobs.json is dumped into metadata. Suggest removing this as it is pure ERT system contents. Alternatively pause it while debugging of outgoing metadata is frequent.
Total number of realizations in a case is not currently submitted, as this can change over the course of a case evolution. E.g. a user might run 100 realizations, then decide to add 100 more. This may be a theoretical problem (for now), but still one that has been taken into account in the metadata definitions.
However, there are requests to include the expected number of realizations to find on a specific case. It would be possible to include the realization count as an input argument to the initializing case methods.
Must develop the parsing of content further so it fulfills various examples found in fmu-metadata definitions:
field_outline: {contact: owc}
fluid_contact: {contact: owc}
seismic: {attribute: xx, zrange: 2, filter_size: 1.0, scaling_factor: 3, is_timelapse: True, acquision_start_date: 2021-11-01, acquision_end_date: 2021-12-03, acquision_reference_date: 2021-11-20, offset: 0-15}
The field stratigraphic
must be set to False as default
When installing in live example, the need for handling aggregated data quickly becomes evident. This is slightly different compared to realizations.
fmu.realization
replaced by fmu.aggregation
which would need input.element_id
must be includedCurrent method is to use existing realization as a template for creating aggregated metadata.
Consider this a placeholder. Needs to be refined and properly described.
If a user initializes a case with one iteration, the case metadata is made. If the user then runs another iteration, and the case is re-initialized, the case metadata is overwritten and a new uuid is made.
If case metadata already exists, the case.uuid must be kept as-is.
In the prototype, existing metadata was parsed and some sanity checks were done and the fmu.case.uuid was kept before overwriting the metadata using the same uuid. An simpler approach for now could be to not overwrite case metadata if it already exists.
fmu-dataio/src/fmu/dataio/_export_item.py
Line 31 in 081f586
fmu-dataio/src/fmu/dataio/_export_item.py
Line 365 in 73965d2
poly.pname
is "POLY_ID". This is not a valid column name in poly.dataframe hence the function fails with
KeyError: 'POLY_ID'
Possibly inherited from XTgeo?
Suggest to add regions
to valid list of contents, for exporting e.g. volume regions and other regions.
Suggest to implement with no associated extra input for now.
fmu-dataio: Knowledge of the pa.Table format. Similar to current handling of pd.Dataframe.
exp = ExportData()
exp.to_disk(pa.Table)
Purpose of issue is to describe minimum viable functionality.
Suggestions:
The dollar sign should be used for the $schema
tag in outgoing metadata.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.