Giter VIP home page Giter VIP logo

fmu-dataio's Introduction

fmu-dataio

linting Ruff PyPI version PyPI - Python Version PyPI - License ReadTheDocs

fmu-dataio is a library for handling data flow in and out of Fast Model Update workflows. For export, it automates the adherence to the FMU data standard โœ… including both file and folder conventions as well as richer metadata ๐Ÿ”– for use by various data consumers both inside and outside the FMU context via Sumo.

fmu-dataio is designed to be used with the same syntax in all parts of an FMU workflow, including post- and pre-processing jobs and as part of ERT FORWARD_MODEL, both inside and outside RMS.

๐Ÿ‘‰ Detailed documentation for fmu-dataio at Read the Docs. ๐Ÿ‘€

fmu-dataio is also showcased in Drogon. ๐Ÿ’ช

Data standard definitions

Radix

The metadata standard is defined by a JSON schema. Within Equinor, the schema is available on a Radix-hosted endpoint โšก

fmu-dataio's People

Contributors

adnejacobsen avatar alifbe avatar berland avatar daniel-sol avatar janbjorge avatar jcrivenaes avatar maninfez avatar mferrera avatar perolavsvendsen avatar rwiker avatar sago64 avatar tnatt avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

fmu-dataio's Issues

Add regions to valid contents

Suggest to add regions to valid list of contents, for exporting e.g. volume regions and other regions.

Suggest to implement with no associated extra input for now.

Tracklog for cases to record runs

It would be useful if the tracklog for cases recorded the runs made. Currently, I assume it will simply be overwritten. Not necessarily urgent, but will require that the program first checks for existing case metadata, then simply returns it with an event appended to the tracklog - or creates new if it does not exist.

It might also be worth including some sanity-checks to catch very strange behavior.

Support the pyarrow format

fmu-dataio: Knowledge of the pa.Table format. Similar to current handling of pd.Dataframe.

Wanted end-state

  • In ERT perspective, one FORWARD_JOB that extracts SMRY-data, exports via fmu-dataio, uploads to Sumo.
  • Starting point (from fmu-dataio perspective): pa.Table object
  • fmu-dataio exports to disk "as usual".

Tasks

  • Implement support for pa.Table as known table object, with arrow as output.

Requirements

  • Minimize footprint in ERT workflow for end user. E.g. rather than one FORWARD_MODEL for each step (data out, metadata on, upload), consolidate as much as possible.
  • fmu-dataio to be sourced in smry2arrow.py

Schematic example

exp = ExportData()
exp.to_disk(pa.Table)

case.uuid must be persisted through many runs

If a user initializes a case with one iteration, the case metadata is made. If the user then runs another iteration, and the case is re-initialized, the case metadata is overwritten and a new uuid is made.

If case metadata already exists, the case.uuid must be kept as-is.

In the prototype, existing metadata was parsed and some sanity checks were done and the fmu.case.uuid was kept before overwriting the metadata using the same uuid. An simpler approach for now could be to not overwrite case metadata if it already exists.

Warnings give additional output

Something weird with the behavior of warnings from inside RMS.

warnings.warn(
                "Exporting to a subfolder is a deviation from the standard "
                "and could have consequences for later dependencies",
                UserWarning
            )

...gives this:

UserWarning: Exporting to a subfolder is a deviation from the standard and could have consequences for later dependencies
  collected/processed and subsequently both 'independent' and object dependent

Originally posted by @perolavsvendsen in #65 (comment)

Get input arguments from global_variables.yml

Passing input arguments to fmu.dataio.ExportData in ERT forward jobs may be tricky given the amount of arguments, and the possible variation between different jobs, assets and users. Suggest defining a section in global_variables, and using that as a (alternative fallback) transporter of input arguments.

Pattern 1:

CFG = read_yaml(global_variables.yml)
exp = ExportData(
    config=CFG,
    arg1=x
    arg2=y,
    arg3=z,
)

Pattern 2:

CFG = read_yaml(global_variables.yml)
exp = ExportData(
    config=CFG,
    )

...where global_variables.yml contains:

arg1: x
arg2: y
arg3: z

...in a block that is known to fmu-dataio.

Could be that name could be the key in global_variables, alternatively a separate argument could specify it, e.g.:

CFG = read_yaml(global_variables.yml)
exp = ExportData(
    config=CFG,
    kwargs=CFG["fmu-dataio"][somename]
    )

The purpose is to enable fmu-dataio to be used in forward_models without having to pass a long list of (possibly varying) input arguments. Rather use global_variables.yml for that.

Revision name

revision in outgoing metadata should be the actual revision, not the name of the folder. Folder name can be anything, up to each user.

semeio creates dependencies that are incompatable in RMS 12.0 and earlier

The introduction of semeio creates some major obstacles for using fmu-dataio inside RMS 11.* and 12.0. The reason for this is that RMS 12.0 (python 3.6.1) force the use of specific old numpy (1.13.3) and pandas (0.21) versions which cannot be replaced. The semeio is only applied here for a very minor part of the code, but it forces upgrade of numpy and pandas when installed (it have a large list of dependencies)

I propose reverting the code; take out semeio until we do not longer use RMS 11. and RMS 12.0.**

When dots in the name, filenames are wrongly defined

When exporting data with names that include dots (.), the filenames are wrongly defined.

Example:
Horizon name in RMS: MyFormation_1.1 --> Filename: myformation_1.gri

This is problematic for several reasons, but especially when multiple horizons get the same outgoing filename.

Example:
Horizon names MyFormation_1.1, MyFormation_1.2 and MyFormation_1.3 will be exported with the same filename, hence only the last one will be exported. Also, if there is a MyFormation_1 present, this last one will look like it is this one.

Parsing of "content"

Must develop the parsing of content further so it fulfills various examples found in fmu-metadata definitions:

  • field_outline: {contact: owc}
  • fluid_contact: {contact: owc}
  • seismic: {attribute: xx, zrange: 2, filter_size: 1.0, scaling_factor: 3, is_timelapse: True, acquision_start_date: 2021-11-01, acquision_end_date: 2021-12-03, acquision_reference_date: 2021-11-20, offset: 0-15}

Minimum viable functionality

Purpose of issue is to describe minimum viable functionality.

Suggestions:

  • Able to create/update ensemble metadata according to proposed standard
  • Able to export data according to the proposed standard
  • Work both inside RMS and outside.
    • Export data to meet current REP requirements
      • Regular surfaces
      • Polygons
      • (Able to scale to Tables, Seismic (Cubes?), Grid3D, Points, Well with same patterns.)
  • Work both in template runs (not through ERT) and in FMU runs (through ERT).
    • Not a requirement to produce valid metadata in template runs
  • All interaction with ERT through existing functionality (workflows, forward_jobs, etc).
  • All interaction with RMS through existing functionality.

Should require workflow to be dictionary, or handle it consistently if not

The workflow parameter can be given as anything, and will be directly inserted into outgoing metadata. However, it shall be an object and there are also snippets in the code that suggests that object (dict) is enforced.

String input is expected:

workflow: Optional[str] = None,

Dictionary is made:

self._meta_fmu["workflow"] = OrderedDict()

...but actual behavior is that if a string is given as an argument, the string is carried through fmu-dataio into the outgoing metadata where it is still a string.

Needs some investigation.

Also, there is a typo here that should be fixed:

self._meta_fmu["workflow"]["refence"] = self._workflow

*reference

Default behavior and standards

Purpose of the issue is to describe dependencies and agree on some standards. I assume some (all?) of these must be baked into this code base.

  • Position of template (model) metadata
  • Position of ensemble metadata
    • Behavior of reruns and restarts
  • Awareness of FMU realization
  • Behavior in template runs vs FMU runs

Prototype referred to: Johan Sverdrup implementation + existing Drogon temporary code.


Ensemble metadata
In prototype: <iter>\share\metadata\fmu_ensemble.yaml on \scratch, on ensemble level.
Example: \scratch\osefax\peesv\mycase\share\iter-0\metadata\fmu_ensemble.yaml
This file is copied into each realization during runtime by ERT postsim hook workflow, placed under share\runinfo\fmu_ensemble.yaml to contain each realization (avoid that realization goes out of itself to get data).
In prototype, ensemble metadata is handled by a completely separate script intended to be run through ERT workflows.

Suggestion ๐Ÿ‘‡
Same as prototype, except suggest to embed ensemble metadata into the same code base as other metadata and use the same core functionality.


Reruns
In prototype, if ensemble metadata already exists, a sanity check is performed (same case name, etc). If sanity check passes, the existing ensemble metadata is kept, but an item is appended to the runs block.

Suggestion ๐Ÿ‘‡
Keep the general behavior as in the prototype. When ensemble metadata is first made, establish the metadata and put created as the first item in events. If metadata already exists, perform sanity check, then overwrite existing metadata. Keep the existing events but append an updated-item. (Suggest to use this behavior for all metadata!)

Location: the ensemble-level share folder. For an iteration, this would be scratch\<asset>\<user>\<case>\<iter>\share\metadata\fmu_ensemble.yaml.


Template metadata
In prototype: fmu_template.yaml on revision root.
Example: \resmod\ff\21.0.0\users\user\21.0.0_mycopy\fmu_template.yaml
This file is copied into the realization during runtime, with the same location.

Suggestion ๐Ÿ‘‡
Embed into global_variables under a metadata tag.
Read it during export jobs from already standard location of global_variables.


Awareness of realization
Each realization must be aware of its own realization id, as this is part of the metadata.
In prototype: The realization is injected into global_variables by an ERT FORWARD_JOB.
Example: fmu_realization: 41

Suggestion ๐Ÿ‘‡
Continue to place it in global_variables by a FORWARD_JOB, but follow the proposed metadata structure:
{realization: {id: 41}}


Template runs vs FMU runs
Must work both when running template directy (RMS GUI or other settings) and in FMU runs through ERT. In current prototype, the following behavior is used:

  • In template runs, metadata without the fmu_ensemble block is produced. Realization tag is exported, with value null. These are not valid metadata according to standards, but can still be used for debugging.
  • In batch runs, ensemble metadata is required. The code checks that ensemble metadata is present if realization is set.
realization ensemble metadata Behavior Description
X FAIL Normal run, missing ensemble metadata
X FAIL Normal run, missing realization
X X OK Normal run
OK Template run

Add parameters.txt instead of parameters.json

In case of DESIGN_KW but not GEN_KW, the parameters.json will not be generated.

Hence it is more failsafe to parse parameters.txt, as parameters.json only represents GEN_KW.

In case neither GEN_KW nor DESIGN_KW then no parameters.* will be made, which must be possible.

Read global_variables.yml inside fmu-dataio

To minimize footprint in other scripts, suggest that fmu-dataio incorporates the parsing of global_variables.yml. When the run context is known, the (default) location of global_variables should also be known.

Current:

CFG = read_yaml(global_variables.yml)
exp = ExportData(
    config=CFG
)

Proposed:

exp = ExportData()

Alternatively:

exp = ExportData(
    config_path="path/to/config"
)

Consider making parameters objects

Currently, parameters are key:value pairs. To prepare for future possible expansions, consider making this into an object, e.g.

parameters:
  SENSNAME:
    value: VALUE
    description: MyDescription

Aggregations

When installing in live example, the need for handling aggregated data quickly becomes evident. This is slightly different compared to realizations.

  • fmu.realization replaced by fmu.aggregation which would need input.
  • element_id must be included

Current method is to use existing realization as a template for creating aggregated metadata.

Consider this a placeholder. Needs to be refined and properly described.

tagname for filenames not always used

To be further described. Short story: When using filename tags, they are not always used in the outgoing filename. This may be related to spaces, commas, dots etc in filenames.

Able to run through ERT forward jobs

Currently, fmu-dataio is sensitive to current working directory. It will not work when pwd is on realization root and/or config path or other known contexts when running with ERT.

These are the known contexts that must be supported:

  • ERT forward job
  • ERT workflow
  • RMS job, within ERT forward job
  • RMS job, non-FMU run

Add upload functionality

Currently, fmu-dataio is responsible for exporting data + metadata to (local) disk. Other workflows and functions are responsible for uploading those same data to Sumo. A suggestion is to include upload functionality to fmu-dataio. This would have several benefits, such as:

  • Avoid having to install & maintain several packages
  • Combined with including metadata definitions, allow for better integration tests.
  • Make it easier to, in the future, skip the save-to-file part entirely

Current situation is the following dependency graph:

[data producing script] -> [disk] -> fmu-dataio -> [disk] -> fmu-sumo
"produce the file"                -> "produce metadata"   -> "upload"

We want to pull these together as much as possible. For several scripts, particularly within RMS, fmu-dataio is embedded as a dependency in the data-producing script avoiding the extra [disk] layer. When it comes to upload, the natural pattern is to pull fmu-sumo in as a dependency to fmu-dataio.

Comment: Most data-producing scripts in FMU should not have knowledge of Sumo. Knowledge of Sumo should sit within fmu-dataio.

The role of fmu-dataio is to export (and in the future, import) data. It is not specific to "store data on local disk". Hence storing data in Sumo is similar to storing data on Sumo (or any other place).

Requirements

  • Excessive network traffic must be avoided. E.g. overhead such as getting the Sumo-ID must not be repeated for every file in one batch.
  • Duplicate uploads must be avoided
  • Local export (to disk) must always be possible, to account for data being used further in the workflow
  • Template runs must still be possible.
  • Non-upload must be possible, e.g. when user is not onboarded to Sumo (yet)
  • Avoid API change

Design options and considerations:

Suggest embedding upload functionality into dataio.py and invoke it through arguments in the existing .export method, i.e.:

exp = ExportData(...)
exp.export(obj, sumo=True, local=True)

...where "local" defaults to True and "sumo" defaults to False from day 0. Both can be True.

Further, to avoid excessive network traffic, batch upload is preferred:

E.g. this would be bad:

for obj in my_data_objects:
    exp = ExportData(...)
    exp.export(obj, sumo=True)

...while this would be better:

exp = ExportData(sumo=True) # authentication happens here (once)
for obj in my_data_objects:
    exp.export(obj, sumo=True)

To be further refined!

Exporting surface with only NaNs cause metadata format fail

When exporting a surface with nans only, the bbox.zmin and bbox.zmax in the metadata gets illegal value .nan.

Expected behaviour:
Exporting only Nan's feels weird but should be allowed, as there are multiple possible use cases. However, metadata should be valid. Not sure what the best behavior is. Possibly, zmin/zmax should be null in these cases. It's still conceptually tricky.

Reproducing:

# python

from pathlib import Path
from xtgeo import RegularSurface
from fmu.dataio import ExportData
import numpy as np

# make surface with only undefined values
nans = np.empty((33,50))
nans[:] = np.nan
surf = RegularSurface(ncol=33, nrow=50, xori=34522.22, yori=6433231.21,
                            xinc=25.0, yinc=25.0, rotation=30,
                            values=nans)

exp = ExportData(
    content="fluid_contact",
    unit="m",
    vertical_domain={"depth": "msl"},
    timedata=None,
    is_prediction=True,
    is_observation=False,
)

exp._pwd = Path("./tmp/tmp")
exp.to_file(surf)

# Results in...
#  bbox:
#    zmin: .nan
#    zmax: .nan

To be refined: UUID for iteration and realization

Based on discussions with Sumo and REP. Include a unique ID on the realization and iteration level. The purpose of this is to enable future use cases. Technically, uniqueness can be achieved by any client by combining case.uuid, iteration.id and realization.id but if this is injected into other database systems, a proper uuid4 will be necessary and beneficial.

case:
  uuid: 36e30888-7dc9-4d91-a525-8bf8e0e67cce  <- Generated directly, perhaps more sophisticated in the future
iteration:
  id: 0
  name: "iter-0"
  uuid: 8e5bd5ed-636d-4d38-bcd8-2e538154beda <- Valid uuid4, hash of case.uuid + iteration.id
realization:
  id: 0
  name: "realization-0"
  uuid: 59c69f7d-ce4d-45cb-9da5-f3d13a2fc1b8 <- Valid uuid4, hash of case.uuid + iteration.id + realization.id

Use "surface" as class, not "regularsurface"

Currently, regularsurface is used as class. Used to be surface?

surface seems better, then handle the layout with the layout attribute. Logic in various places (clients) is placed on the class attribute, for which the type of surface is not relevant. Having regularsurface as a class invokes having to also have (at least) structuredsurface (or similar) as another class, which for many clients will then result in having to combine these two different classes.

Suggest using surface consistently, irrespective of layout and/or format.

Handling of UNDEF

Outgoing metadata for tables gets string value Nan for the undef field.
Expected behavior: Either a valid number (-999.25 or similar) or nothing.

Required content extra is not enforced

fluid_contact has additional required information, but passing just the string fluid_contact as contact will not fail.
Wanted behavior: When extra is required, ValidationError should be raised when extra is missing.

CONTENTS_REQUIRED = {

Currently, in ExportItem._data_process_content, the validation is only triggered if useextra is defined. It is undefined by default. But it is only defined if a dictionary is passed. Hence, if a string is passed, validation will not trigger.

def _data_process_content(self):

Should JSON schema be included into fmu.dataio?

Currently, JSON schema definitions + examples are in a separate repo. However, validation is being included into fmu.dataio and some tests involve validation. This means that there could be a risk for out-of-synch.

We could consider merging the two (e.g. putting JSON schema definitions into this repo). This could become complicated wrt revisions etc, so must be discussed.

Submit total number of realizations for a case with the case metadata

Total number of realizations in a case is not currently submitted, as this can change over the course of a case evolution. E.g. a user might run 100 realizations, then decide to add 100 more. This may be a theoretical problem (for now), but still one that has been taken into account in the metadata definitions.

However, there are requests to include the expected number of realizations to find on a specific case. It would be possible to include the realization count as an input argument to the initializing case methods.

Allow for export to subfolder or possibly override datatype

Currently, default file locations are used (share/results/<datatype>). This can get confusing when one datatype is represented by another. E.g. "polygons" is exported to share/results/polygons when an XTgeo polygon object is used, but when multiple polygons are gathered in the same file, they go to share/results/table.

Suggest to allow for overriding where to export so that a pandas dataframe can also be exported to polygons.

Also, with lots of export, this can become messy as some local interaction still happens with the exported files. Consider allowing output_subfolder as an input argument.

E.g. if output_subfolder is given, export will happen to default/output_subfolder.
Example:
output_subfolder="MyFolder" --> Export folder: share/results/<datatype>/myfolder

Use cases for this:

  • Post-process clean-up scripts are based on folder, not filenames
  • Find-all mechanisms in visualisation functions or aggregation functions are based on "take all files in this folder".

Visual settings (display)

Visual settings is included in the metadata schema, but not explicitly set by fmu-dataio. Legacy usage of metadata rests on a lot of visual settings being set while generating data, and in the JS workflow defaults have been included in global_variables. This is a pattern that could be continued into fmu-dataio possibly, combined with the option of giving visual settings as input argument.

There is a larger discussion here, however, relating to the responsibility of the data producer vs the data consumer. I think it might be a better (?) pattern to let clients decide visual settings as much as possible. That said, however, visual settings could be the result of domain-specific rules and know-how which could be difficult to convey if not explicitly part of the metadata.

Suggestion, for discussion:
Create a separate VisualSettings object to hold (and calculate) visual settings and pass that to fmu-dataio.ExportData as an argument. If not passed, a skeleton version with default values is included. Experience from JS is that this can quickly escalate, which is why I think keeping it as a separate module is wise. Also, given that patterns might change, I think keeping the interface with the rest of fmu-dataio slim is smart.

Confusing API for ExportData.to_file()

In the frontpage example:

def export_some_surface():
    srf = xtgeo.surface_from_file("top_of_some.gri")

    exp = ExportData(
        config=CFG,
        content="depth",
        unit="m",
        vertical_domain={"depth": "msl"},
        timedata=None,
        is_prediction=True,
        is_observation=False,
        tagname="Some Descr",
        verbosity="WARNING",
    )

    exp.to_file(srf)

the usage of exp.to_file clashes with the similar function pandas.DataFrame.to_csv (and similar functions elsewhere) where the latter takes the filename to write to as its first positional argument, while exp.to_file() takes the data to export as its argument. This might lead to confusion.

Possible solutions:

  • Keep as is
  • Add the data to export into the exp object at initialization, or as a separate function call, and potentially fail if data is passed on to to_file().
  • Force named arguments to to_file() so that you have to write e.g. exp.to_file(data=srf)
  • ?

access_ssdl argument not used

The argument access_ssdl is valid, but giving it to ExportData has no effect. Default values from global_variables are still used.

Failing to parse parameters.txt for live example

Tail of stack:

RuntimeError: Unexpected structure of parameters.txt, line is: :::::::INIT_FILES:AQUA_TRANS_FACTOR_NORM	-0.89905

It fails on the first line of the file, which looks like this:

       INIT_FILES:AQUA_TRANS_FACTOR_NORM	-0.89905

This seems to be unrelated to the first line, as it still fails when putting in a dummy entry:
File:
TST 1
Tail of stack:

RuntimeError: Unexpected structure of parameters.txt, line is: :::::::::::::::::::::::::::::::::::::TST::::1```

Add to pypi

Make workflow for pypi uploading when tagging a release

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.