equinor / fmu-dataio Goto Github PK

View Code? Open in Web Editor NEW

9.0 8.0 13.0 23.98 MB

FMU data standard and data export with rich metadata in the FMU context

Home Page: https://fmu-dataio.readthedocs.io/en/latest/

License: Apache License 2.0

Python 99.46% Dockerfile 0.17% Shell 0.37%

data fmu jsonschema python subsurface sumo

fmu-dataio's Issues

Parsing of timedata is buggy (?)

fmu-dataio/src/fmu/dataio/_export_item.py

Line 409 in b35a317

tdate = str(xtime[0])

Working on a string from the timedata list, but then taking the first element, which becomes the first character. Leads to this in the wild:

ValueError: time data '2' does not match format '%Y%m%d'

Input to dataio is e.g.:

timedata=["20150101", "20210101"]

Add realization and iter to relative path (i.e. make case root for relative path)

Before:

relative_path: share/results/maps/some.gri

Shall be relative to case directory, so that aggradations also use the same root

relative_path: realization-88/iter-2/share/results/maps/some.gri

access_ssdl argument not used

The argument access_ssdl is valid, but giving it to ExportData has no effect. Default values from global_variables are still used.

Add to pypi

Make workflow for pypi uploading when tagging a release

semeio creates dependencies that are incompatable in RMS 12.0 and earlier

The introduction of semeio creates some major obstacles for using fmu-dataio inside RMS 11.* and 12.0. The reason for this is that RMS 12.0 (python 3.6.1) force the use of specific old numpy (1.13.3) and pandas (0.21) versions which cannot be replaced. The semeio is only applied here for a very minor part of the code, but it forces upgrade of numpy and pandas when installed (it have a large list of dependencies)

I propose reverting the code; take out semeio until we do not longer use RMS 11. and RMS 12.0.**

Tracklog for cases to record runs

It would be useful if the tracklog for cases recorded the runs made. Currently, I assume it will simply be overwritten. Not necessarily urgent, but will require that the program first checks for existing case metadata, then simply returns it with an event appended to the tracklog - or creates new if it does not exist.

It might also be worth including some sanity-checks to catch very strange behavior.

Use "surface" as class, not "regularsurface"

Currently, regularsurface is used as class. Used to be surface?

surface seems better, then handle the layout with the layout attribute. Logic in various places (clients) is placed on the class attribute, for which the type of surface is not relevant. Having regularsurface as a class invokes having to also have (at least) structuredsurface (or similar) as another class, which for many clients will then result in having to combine these two different classes.

Suggest using surface consistently, irrespective of layout and/or format.

Assumed syntax in parameters.txt is not aligned with semeio

Reference implementation for parameters.txt-parsing is in
https://github.com/equinor/semeio/blob/master/semeio/jobs/design_kw/design_kw.py

This is not in line with:

fmu-dataio/src/fmu/dataio/_utils.py

Line 220 in c3111c9

GLOBVAR:VOLON_FLOODPLAIN_VOLFRAC 0.256355

which interprets a nested dictionary structure from the keys based on colons in the key string.

Semeio actively ignores prefixes before colons in keys.

Include file size in outgoing metadata

Calculate file size in bytes and include in the file tag, next to checksum_md5

Typo in function for the workflow item

fmu-dataio/src/fmu/dataio/dataio.py

Line 246 in 081f586

self._meta_fmu["workflow"]["refence"] = self._workflow

*reference

Get input arguments from global_variables.yml

Passing input arguments to fmu.dataio.ExportData in ERT forward jobs may be tricky given the amount of arguments, and the possible variation between different jobs, assets and users. Suggest defining a section in global_variables, and using that as a (alternative fallback) transporter of input arguments.

Pattern 1:

CFG = read_yaml(global_variables.yml)
exp = ExportData(
    config=CFG,
    arg1=x
    arg2=y,
    arg3=z,
)

Pattern 2:

CFG = read_yaml(global_variables.yml)
exp = ExportData(
    config=CFG,
    )

...where global_variables.yml contains:

arg1: x
arg2: y
arg3: z

...in a block that is known to fmu-dataio.

Could be that name could be the key in global_variables, alternatively a separate argument could specify it, e.g.:

CFG = read_yaml(global_variables.yml)
exp = ExportData(
    config=CFG,
    kwargs=CFG["fmu-dataio"][somename]
    )

The purpose is to enable fmu-dataio to be used in forward_models without having to pass a long list of (possibly varying) input arguments. Rather use global_variables.yml for that.

Able to run through ERT forward jobs

Currently, fmu-dataio is sensitive to current working directory. It will not work when pwd is on realization root and/or config path or other known contexts when running with ERT.

These are the known contexts that must be supported:

ERT forward job
ERT workflow
RMS job, within ERT forward job
RMS job, non-FMU run

Export dataframe without index

Dataframe is exported with the index included. Not sure what is best practice here, but I think exporting without should be either default or an option.

fmu-dataio/src/fmu/dataio/_export_item.py

Line 803 in 9074d45

obj.to_csv(outfile)

To be refined: UUID for iteration and realization

Based on discussions with Sumo and REP. Include a unique ID on the realization and iteration level. The purpose of this is to enable future use cases. Technically, uniqueness can be achieved by any client by combining case.uuid, iteration.id and realization.id but if this is injected into other database systems, a proper uuid4 will be necessary and beneficial.

case:
  uuid: 36e30888-7dc9-4d91-a525-8bf8e0e67cce  <- Generated directly, perhaps more sophisticated in the future
iteration:
  id: 0
  name: "iter-0"
  uuid: 8e5bd5ed-636d-4d38-bcd8-2e538154beda <- Valid uuid4, hash of case.uuid + iteration.id
realization:
  id: 0
  name: "realization-0"
  uuid: 59c69f7d-ce4d-45cb-9da5-f3d13a2fc1b8 <- Valid uuid4, hash of case.uuid + iteration.id + realization.id

Warnings give additional output

Something weird with the behavior of warnings from inside RMS.

warnings.warn(
                "Exporting to a subfolder is a deviation from the standard "
                "and could have consequences for later dependencies",
                UserWarning
            )

...gives this:

UserWarning: Exporting to a subfolder is a deviation from the standard and could have consequences for later dependencies
  collected/processed and subsequently both 'independent' and object dependent

Originally posted by @perolavsvendsen in #65 (comment)

Enable grid_model argument

Encountered data in live example that requires grid_model included (only thing separating otherwise identical data objects).

fmu-dataio/src/fmu/dataio/dataio.py

Line 268 in 9074d45

# self._meta_fmu["grid_model"] = self._process_meta_fmu_grid_model()

Add fault_lines to list of valid contents

fmu-dataio/src/fmu/dataio/_export_item.py

Line 31 in 081f586

ALLOWED_CONTENTS = {

Should require workflow to be dictionary, or handle it consistently if not

The workflow parameter can be given as anything, and will be directly inserted into outgoing metadata. However, it shall be an object and there are also snippets in the code that suggests that object (dict) is enforced.

String input is expected:

fmu-dataio/src/fmu/dataio/dataio.py

Line 95 in 081f586

workflow: Optional[str] = None,

Dictionary is made:

fmu-dataio/src/fmu/dataio/dataio.py

Line 245 in 081f586

self._meta_fmu["workflow"] = OrderedDict()

...but actual behavior is that if a string is given as an argument, the string is carried through fmu-dataio into the outgoing metadata where it is still a string.

Needs some investigation.

Also, there is a typo here that should be fixed:

fmu-dataio/src/fmu/dataio/dataio.py

Line 246 in 081f586

self._meta_fmu["workflow"]["refence"] = self._workflow

*reference

Visual settings (display)

Visual settings is included in the metadata schema, but not explicitly set by fmu-dataio. Legacy usage of metadata rests on a lot of visual settings being set while generating data, and in the JS workflow defaults have been included in global_variables. This is a pattern that could be continued into fmu-dataio possibly, combined with the option of giving visual settings as input argument.

There is a larger discussion here, however, relating to the responsibility of the data producer vs the data consumer. I think it might be a better (?) pattern to let clients decide visual settings as much as possible. That said, however, visual settings could be the result of domain-specific rules and know-how which could be difficult to convey if not explicitly part of the metadata.

Suggestion, for discussion:
Create a separate VisualSettings object to hold (and calculate) visual settings and pass that to fmu-dataio.ExportData as an argument. If not passed, a skeleton version with default values is included. Experience from JS is that this can quickly escalate, which is why I think keeping it as a separate module is wise. Also, given that patterns might change, I think keeping the interface with the rest of fmu-dataio slim is smart.

Revision name

revision in outgoing metadata should be the actual revision, not the name of the folder. Folder name can be anything, up to each user.

Consider making parameters objects

Currently, parameters are key:value pairs. To prepare for future possible expansions, consider making this into an object, e.g.

parameters:
  SENSNAME:
    value: VALUE
    description: MyDescription

Required content extra is not enforced

fluid_contact has additional required information, but passing just the string fluid_contact as contact will not fail.
Wanted behavior: When extra is required, ValidationError should be raised when extra is missing.

fmu-dataio/src/fmu/dataio/_export_item.py

Line 55 in a5627f4

CONTENTS_REQUIRED = {

Currently, in ExportItem._data_process_content, the validation is only triggered if useextra is defined. It is undefined by default. But it is only defined if a dictionary is passed. Hence, if a string is passed, validation will not trigger.

fmu-dataio/src/fmu/dataio/_export_item.py

Line 275 in a5627f4

def _data_process_content(self):

Allow KH-product as valid content

In the JS workflow, this is referred to as kxh to avoid confusion with Kh/horizontal permeability.

Handling of UNDEF

Outgoing metadata for tables gets string value Nan for the undef field.
Expected behavior: Either a valid number (-999.25 or similar) or nothing.

Failing to parse parameters.txt for live example

Tail of stack:

RuntimeError: Unexpected structure of parameters.txt, line is: :::::::INIT_FILES:AQUA_TRANS_FACTOR_NORM	-0.89905

It fails on the first line of the file, which looks like this:

       INIT_FILES:AQUA_TRANS_FACTOR_NORM	-0.89905

This seems to be unrelated to the first line, as it still fails when putting in a dummy entry:
File:
TST 1
Tail of stack:

RuntimeError: Unexpected structure of parameters.txt, line is: :::::::::::::::::::::::::::::::::::::TST::::1```

tagname for filenames not always used

To be further described. Short story: When using filename tags, they are not always used in the outgoing filename. This may be related to spaces, commas, dots etc in filenames.

Print list of valid contents when validation fails

fmu-dataio/src/fmu/dataio/_export_item.py

Line 280 in 081f586

raise ValidationError(f"Invalid content: <{usecontent}> is not in list!")

This requires user to access the code base to find out which elements is in the list. List of valid contents should be printed when validation fails.

Confusing API for ExportData.to_file()

In the frontpage example:

def export_some_surface():
    srf = xtgeo.surface_from_file("top_of_some.gri")

    exp = ExportData(
        config=CFG,
        content="depth",
        unit="m",
        vertical_domain={"depth": "msl"},
        timedata=None,
        is_prediction=True,
        is_observation=False,
        tagname="Some Descr",
        verbosity="WARNING",
    )

    exp.to_file(srf)

the usage of exp.to_file clashes with the similar function pandas.DataFrame.to_csv (and similar functions elsewhere) where the latter takes the filename to write to as its first positional argument, while exp.to_file() takes the data to export as its argument. This might lead to confusion.

Possible solutions:

Keep as is
Add the data to export into the exp object at initialization, or as a separate function call, and potentially fail if data is passed on to to_file().
Force named arguments to to_file() so that you have to write e.g. exp.to_file(data=srf)
?

Use display_name from global_variables.yml

Currently, display_name is not using the given display_name in global_variables if this exists.

fmu-dataio/src/fmu/dataio/_export_item.py

Line 688 in b35a317

display.name can be set through the display_name argument to

Expected behavior: If user puts display_name in the stratigraphic-block of global_variables, this should be used.

The access field is missing for case export

Add parameters.txt instead of parameters.json

In case of DESIGN_KW but not GEN_KW, the parameters.json will not be generated.

Hence it is more failsafe to parse parameters.txt, as parameters.json only represents GEN_KW.

In case neither GEN_KW nor DESIGN_KW then no parameters.* will be made, which must be possible.

Read global_variables.yml inside fmu-dataio

To minimize footprint in other scripts, suggest that fmu-dataio incorporates the parsing of global_variables.yml. When the run context is known, the (default) location of global_variables should also be known.

Current:

CFG = read_yaml(global_variables.yml)
exp = ExportData(
    config=CFG
)

Proposed:

exp = ExportData()

Alternatively:

exp = ExportData(
    config_path="path/to/config"
)

Should JSON schema be included into fmu.dataio?

Currently, JSON schema definitions + examples are in a separate repo. However, validation is being included into fmu.dataio and some tests involve validation. This means that there could be a risk for out-of-synch.

We could consider merging the two (e.g. putting JSON schema definitions into this repo). This could become complicated wrt revisions etc, so must be discussed.

When dots in the name, filenames are wrongly defined

When exporting data with names that include dots (.), the filenames are wrongly defined.

Example:
Horizon name in RMS: MyFormation_1.1 --> Filename: myformation_1.gri

This is problematic for several reasons, but especially when multiple horizons get the same outgoing filename.

Example:
Horizon names MyFormation_1.1, MyFormation_1.2 and MyFormation_1.3 will be exported with the same filename, hence only the last one will be exported. Also, if there is a MyFormation_1 present, this last one will look like it is this one.

Default behavior and standards

Purpose of the issue is to describe dependencies and agree on some standards. I assume some (all?) of these must be baked into this code base.

Position of template (model) metadata
Position of ensemble metadata
- Behavior of reruns and restarts
Awareness of FMU realization
Behavior in template runs vs FMU runs

Prototype referred to: Johan Sverdrup implementation + existing Drogon temporary code.

Ensemble metadata
In prototype: <iter>\share\metadata\fmu_ensemble.yaml on \scratch, on ensemble level.
Example: \scratch\osefax\peesv\mycase\share\iter-0\metadata\fmu_ensemble.yaml
This file is copied into each realization during runtime by ERT postsim hook workflow, placed under share\runinfo\fmu_ensemble.yaml to contain each realization (avoid that realization goes out of itself to get data).
In prototype, ensemble metadata is handled by a completely separate script intended to be run through ERT workflows.

Suggestion 👇
Same as prototype, except suggest to embed ensemble metadata into the same code base as other metadata and use the same core functionality.

Reruns
In prototype, if ensemble metadata already exists, a sanity check is performed (same case name, etc). If sanity check passes, the existing ensemble metadata is kept, but an item is appended to the runs block.

Suggestion 👇
Keep the general behavior as in the prototype. When ensemble metadata is first made, establish the metadata and put created as the first item in events. If metadata already exists, perform sanity check, then overwrite existing metadata. Keep the existing events but append an updated-item. (Suggest to use this behavior for all metadata!)

Location: the ensemble-level share folder. For an iteration, this would be scratch\<asset>\<user>\<case>\<iter>\share\metadata\fmu_ensemble.yaml.

Template metadata
In prototype: fmu_template.yaml on revision root.
Example: \resmod\ff\21.0.0\users\user\21.0.0_mycopy\fmu_template.yaml
This file is copied into the realization during runtime, with the same location.

Suggestion 👇
Embed into global_variables under a metadata tag.
Read it during export jobs from already standard location of global_variables.

Awareness of realization
Each realization must be aware of its own realization id, as this is part of the metadata.
In prototype: The realization is injected into global_variables by an ERT FORWARD_JOB.
Example: fmu_realization: 41

Suggestion 👇
Continue to place it in global_variables by a FORWARD_JOB, but follow the proposed metadata structure:
{realization: {id: 41}}

Template runs vs FMU runs
Must work both when running template directy (RMS GUI or other settings) and in FMU runs through ERT. In current prototype, the following behavior is used:

In template runs, metadata without the fmu_ensemble block is produced. Realization tag is exported, with value null. These are not valid metadata according to standards, but can still be used for debugging.
In batch runs, ensemble metadata is required. The code checks that ensemble metadata is present if realization is set.

`realization`	`ensemble metadata`	Behavior	Description
X		FAIL	Normal run, missing ensemble metadata
	X	FAIL	Normal run, missing realization
X	X	OK	Normal run
		OK	Template run

Allow for export to subfolder or possibly override datatype

Currently, default file locations are used (share/results/<datatype>). This can get confusing when one datatype is represented by another. E.g. "polygons" is exported to share/results/polygons when an XTgeo polygon object is used, but when multiple polygons are gathered in the same file, they go to share/results/table.

Suggest to allow for overriding where to export so that a pandas dataframe can also be exported to polygons.

Also, with lots of export, this can become messy as some local interaction still happens with the exported files. Consider allowing output_subfolder as an input argument.

E.g. if output_subfolder is given, export will happen to default/output_subfolder.
Example:
output_subfolder="MyFolder" --> Export folder: share/results/<datatype>/myfolder

Use cases for this:

Post-process clean-up scripts are based on folder, not filenames
Find-all mechanisms in visualisation functions or aggregation functions are based on "take all files in this folder".

Add upload functionality

Currently, fmu-dataio is responsible for exporting data + metadata to (local) disk. Other workflows and functions are responsible for uploading those same data to Sumo. A suggestion is to include upload functionality to fmu-dataio. This would have several benefits, such as:

Avoid having to install & maintain several packages
Combined with including metadata definitions, allow for better integration tests.
Make it easier to, in the future, skip the save-to-file part entirely

Current situation is the following dependency graph:

[data producing script] -> [disk] -> fmu-dataio -> [disk] -> fmu-sumo
"produce the file"                -> "produce metadata"   -> "upload"

We want to pull these together as much as possible. For several scripts, particularly within RMS, fmu-dataio is embedded as a dependency in the data-producing script avoiding the extra [disk] layer. When it comes to upload, the natural pattern is to pull fmu-sumo in as a dependency to fmu-dataio.

Comment: Most data-producing scripts in FMU should not have knowledge of Sumo. Knowledge of Sumo should sit within fmu-dataio.

The role of fmu-dataio is to export (and in the future, import) data. It is not specific to "store data on local disk". Hence storing data in Sumo is similar to storing data on Sumo (or any other place).

Requirements

Excessive network traffic must be avoided. E.g. overhead such as getting the Sumo-ID must not be repeated for every file in one batch.
Duplicate uploads must be avoided
Local export (to disk) must always be possible, to account for data being used further in the workflow
Template runs must still be possible.
Non-upload must be possible, e.g. when user is not onboarded to Sumo (yet)
Avoid API change

Design options and considerations:

Suggest embedding upload functionality into dataio.py and invoke it through arguments in the existing .export method, i.e.:

exp = ExportData(...)
exp.export(obj, sumo=True, local=True)

...where "local" defaults to True and "sumo" defaults to False from day 0. Both can be True.

Further, to avoid excessive network traffic, batch upload is preferred:

E.g. this would be bad:

for obj in my_data_objects:
    exp = ExportData(...)
    exp.export(obj, sumo=True)

...while this would be better:

exp = ExportData(sumo=True) # authentication happens here (once)
for obj in my_data_objects:
    exp.export(obj, sumo=True)

To be further refined!

Exporting surface with only NaNs cause metadata format fail

When exporting a surface with nans only, the bbox.zmin and bbox.zmax in the metadata gets illegal value .nan.

Expected behaviour:
Exporting only Nan's feels weird but should be allowed, as there are multiple possible use cases. However, metadata should be valid. Not sure what the best behavior is. Possibly, zmin/zmax should be null in these cases. It's still conceptually tricky.

Reproducing:

# python

from pathlib import Path
from xtgeo import RegularSurface
from fmu.dataio import ExportData
import numpy as np

# make surface with only undefined values
nans = np.empty((33,50))
nans[:] = np.nan
surf = RegularSurface(ncol=33, nrow=50, xori=34522.22, yori=6433231.21,
                            xinc=25.0, yinc=25.0, rotation=30,
                            values=nans)

exp = ExportData(
    content="fluid_contact",
    unit="m",
    vertical_domain={"depth": "msl"},
    timedata=None,
    is_prediction=True,
    is_observation=False,
)

exp._pwd = Path("./tmp/tmp")
exp.to_file(surf)

# Results in...
#  bbox:
#    zmin: .nan
#    zmax: .nan

Pause dump of jobs.json contents into metadata

Contents of jobs.json is dumped into metadata. Suggest removing this as it is pure ERT system contents. Alternatively pause it while debugging of outgoing metadata is frequent.

Submit total number of realizations for a case with the case metadata

Total number of realizations in a case is not currently submitted, as this can change over the course of a case evolution. E.g. a user might run 100 realizations, then decide to add 100 more. This may be a theoretical problem (for now), but still one that has been taken into account in the metadata definitions.

However, there are requests to include the expected number of realizations to find on a specific case. It would be possible to include the realization count as an input argument to the initializing case methods.

Parsing of "content"

Must develop the parsing of content further so it fulfills various examples found in fmu-metadata definitions:

field_outline: {contact: owc}
fluid_contact: {contact: owc}
seismic: {attribute: xx, zrange: 2, filter_size: 1.0, scaling_factor: 3, is_timelapse: True, acquision_start_date: 2021-11-01, acquision_end_date: 2021-12-03, acquision_reference_date: 2021-11-20, offset: 0-15}

The field "data:stratigraphic" is required

The field stratigraphic must be set to False as default

Aggregations

When installing in live example, the need for handling aggregated data quickly becomes evident. This is slightly different compared to realizations.

fmu.realization replaced by fmu.aggregation which would need input.
element_id must be included

Current method is to use existing realization as a template for creating aggregated metadata.

Consider this a placeholder. Needs to be refined and properly described.

case.uuid must be persisted through many runs

If a user initializes a case with one iteration, the case metadata is made. If the user then runs another iteration, and the case is re-initialized, the case metadata is overwritten and a new uuid is made.

If case metadata already exists, the case.uuid must be kept as-is.

In the prototype, existing metadata was parsed and some sanity checks were done and the fmu.case.uuid was kept before overwriting the metadata using the same uuid. An simpler approach for now could be to not overwrite case metadata if it already exists.

Include thickness in list of valid contents

fmu-dataio/src/fmu/dataio/_export_item.py

Line 31 in 081f586

ALLOWED_CONTENTS = {

Polygons export fail on wrong column name being referenced

fmu-dataio/src/fmu/dataio/_export_item.py

Line 365 in 73965d2

meta["spec"]["npolys"] = np.unique(poly.dataframe[poly.pname].values).size

poly.pname is "POLY_ID". This is not a valid column name in poly.dataframe hence the function fails with
KeyError: 'POLY_ID'

Possibly inherited from XTgeo?

Add regions to valid contents

Suggest to add regions to valid list of contents, for exporting e.g. volume regions and other regions.

Suggest to implement with no associated extra input for now.

Support the pyarrow format

fmu-dataio: Knowledge of the pa.Table format. Similar to current handling of pd.Dataframe.

Wanted end-state

In ERT perspective, one FORWARD_JOB that extracts SMRY-data, exports via fmu-dataio, uploads to Sumo.
Starting point (from fmu-dataio perspective): pa.Table object
fmu-dataio exports to disk "as usual".

Tasks

Implement support for pa.Table as known table object, with arrow as output.

Requirements

Minimize footprint in ERT workflow for end user. E.g. rather than one FORWARD_MODEL for each step (data out, metadata on, upload), consolidate as much as possible.
fmu-dataio to be sourced in smry2arrow.py

Schematic example

exp = ExportData()
exp.to_disk(pa.Table)

Minimum viable functionality

Purpose of issue is to describe minimum viable functionality.

Suggestions:

Able to create/update ensemble metadata according to proposed standard
Able to export data according to the proposed standard
Work both inside RMS and outside.
- Export data to meet current REP requirements
  - Regular surfaces
  - Polygons
  - (Able to scale to Tables, Seismic (Cubes?), Grid3D, Points, Well with same patterns.)
Work both in template runs (not through ERT) and in FMU runs (through ERT).
- Not a requirement to produce valid metadata in template runs
All interaction with ERT through existing functionality (workflows, forward_jobs, etc).
All interaction with RMS through existing functionality.

schema tag should be $schema

The dollar sign should be used for the $schema tag in outgoing metadata.

equinor / fmu-dataio Goto Github PK

fmu-dataio's Issues

Wanted end-state

Tasks

Requirements

Schematic example

Recommend Projects

Recommend Topics

Recommend Org