Giter VIP home page Giter VIP logo

fmu-ensemble's Introduction

https://img.shields.io/github/workflow/status/equinor/fmu-ensemble/fmu-ensemble https://img.shields.io/lgtm/alerts/g/equinor/fmu-ensemble.svg?logo=lgtm&logoWidth=18 https://img.shields.io/lgtm/grade/python/g/equinor/fmu-ensemble.svg?logo=lgtm&logoWidth=18

Introduction to fmu.ensemble

FMU Ensemble is a Python module for handling simulation ensembles originating from an FMU (Fast Model Update) workflow.

For documentation, see the github pages for this repository.

Ensembles consist of realizations. Realizations consist of (input and) output from their associated jobs stored in text or binary files. Selected file formats (text and binary) are supported.

This module will help you handle ensembles and realizations (and their associated data) as Python objects, and thereby facilitating the use use of other Python visualizations modules like webviz, plotly or interactive usage in IPython/Jupyter.

If run as a post-workflow in Ert, a simple script using this library can replace and extend the existing CSV_EXPORT1 workflow

This software is released under GPL v3.0

fmu-ensemble's People

Contributors

anders-kiaer avatar asnyv avatar berland avatar bkhegstad avatar dafeda avatar dansava avatar eivindjahren avatar fahaddilib avatar jcrivenaes avatar jonathan-eq avatar larsevj avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

fmu-ensemble's Issues

Easier to work with history vectors observations

  • Add doc on how to load observations from a yaml string - easier in notebooks compared to inputting the dict-list-dict structure.
  • Guess histvec automatically, it does not need to be required for smryh
  • Fails hard on non-existing summary keys, should perhaps just warn about them
  • Warnings from realization object should include the realization index.
  • "global" time_index specified at the top level for smryh possible?
  • Incomplete docs, no mention of scalar values

Default log level

Ensure that default log-level is WARNING.

If logging is not set, not even ERROR messages are printed to the console.

Avoid recurring into restart unsmry files

Parsing data from from UNSMRY files which are restarted from others is supported by libecl, but is sometimes a minefield.

EclSum supports include_restart=False to be set to avoid this recurrence, and this should be supported from fmu-ensemble

Full support for timestamps in observations module

The observation module has been made with date accuracy, so observations timestamped to datetime will not work.

This is relevant for DST well tests.

Task is to add support for it, and with extensive tests.

See also #82

EnsembleSet initialization bugs

When standing in the directory containing the realizations directories:

EnsembleSet(frompath='.') does not work, zero realizations.
EnsembleSet(name='foo', frompath='.') does work.
EnsembleSet(frompath='.', name='foo') does not work.

Also check whether frompath='.' can be default in some situations, so all that is needed is
EnsembleSet() when you are in the correct place.

get_smry, time_index

Make it possible to use a spesific date with get_smry

Example:
date='2019-01-01'
smry = ens.get_smry(column_keys=['GPTH:'], time_index=date)

VirtualEnsemble manifest seems to have minor bugs

Line 75 should probably read self._manifest = manifest. Check why this has not been caught by tests.

Defaulting manifest does not work properly, to_virtual() on an EnsembleCombination will f.ex issue a manifest warning which is not relevant.

Make it possible to turn off auto-discovery of Eclipse UNSMRY files

There can be situations where automatic discovery of UNSMRY files is not wanted. If there are multiple Eclipse runs in eclipse/model, and the user wants to control which one is used, find_files() is the recommended practice. But, if the wanted Eclipse run has crashed in one realization, the auto-discovery might kick in and discover another run, which in that case would be erroneous.

ensemblecombination.get_df() index guessing

It should be possible to override the guessing of indices in ensemblecombination.get_df(). Especially there will be situations where ZONE and REGION have different names.

Also applies to realizationcombination.

EnsembleCombinations lack a VirtualEnsemble API

EnsembleCombinations objects should act as VirtualEnsemble objects, and implement functions like agg() so the user do not have to call to_virtual().

It could still be recommended to call .to_virtual() explicitly for the user in case the object is to be reused.

Consider changing default behaviour for missing data in Combinations

RealizationCombination and EnsembleCombination can do linear combination of dataframes. When these are indexed by a DATE column, it will only combine for DATEs existing in both datasets, and drop the rest.

For summary data, there is get_smry() in VirtualEnsemble support that will extrapolate any summary data correctly (zero for rate vectors, constant for cumulative vectors), meaning it is technically possible to combine any realizations summary data (even with no overlapping DATE). This is relevant in situations where the end-date of a simulation is variable by design. Right now this can probably be worked around by providing list of datetimes, or possibly an end_date, but would require custom coding.

If functionality is changed to always extrapolate, there will be side-effects when end-date is variable due to errors. It probably makes more sense to put responsibility on the user for filtering out bad simulations.

@asnyv

Add dictionary for metadata to ensembles

Some applications need to associate a dictionary of metadata to ensembles.

This probably has to be a class member, typically initialized from a yaml file.

Should __init__ support it, or should we require an extra function call to set it.

It requires support in virtual_ensembles, and in to/from_disk.

Should it be reset to None in EnsembleCombinations?

Should there be a default filename that can be looked for? A filename in use is share/runinfo/runinfo.yaml relative to the ensemble root.

Method to get EnsembleSet statistics.

The methode would allow to get a dataframe containing the summary-statistics of an EnsembleSet in a form similar to EnsembleSet.get_smry().

It would wrap around Ensemble.get_smry_stats() and aggregates the results in a concated dataframe including a column "ENSEMBLE" describing the ensemble the data came from.

Input: EnsembleSet
Output: dataframe of aggregated summary-statistics of individual ensembles (not statistics of the ensemble-set as a whole).

load_scalar() is not Python3 compatible

=================================== FAILURES ===================================
_____________________________ test_reek001_scalars _____________________________
    def test_reek001_scalars():
        """Test import of scalar values from files
    
        Files with scalar values can contain numerics or strings,
        or be empty."""
    
        if "__file__" in globals():
            # Easen up copying test code into interactive sessions
            testdir = os.path.dirname(os.path.abspath(__file__))
        else:
            testdir = os.path.abspath(".")
    
        reekensemble = ScratchEnsemble(
            "reektest", testdir + "/data/testensemble-reek001/" + "realization-*/iter-0"
        )
    
        assert "OK" in reekensemble.keys()
        assert isinstance(reekensemble.get_df("OK"), pd.DataFrame)
        assert len(reekensemble.get_df("OK")) == 5
    
        # One of the npv.txt files contains the string "error!"
        reekensemble.load_scalar("npv.txt")
        npv = reekensemble.get_df("npv.txt")
        assert isinstance(npv, pd.DataFrame)
        assert "REAL" in npv
        assert "npv.txt" in npv  # filename is the column name
>       assert len(npv) == 5
E       assert 1 == 5
E        +  where 1 = len(   REAL npv.txt\n0     4  error!)
tests/test_ensemble.py:247: AssertionError

Improve observation support

Possible features for the Observation class:

  • to_yaml() implementation
  • __repr__() so that we can explore the content from the ipython shell
  • mismatch() function should support ensemblesets.
  • to_ert_observations() dump to observation file that ERT can load and condition to
  • "to_resinsight_csv()" dump to the observation CSV file that ResInsight can import
  • Be able to compare to data from the RFT file
  • Make a Observations.misfit() should return one value pr. realization.
  • Support for observatoin correlations in the misfit calculation
  • Verify the misfit calculation, is it identical, or can it be made identical to ERT, should it be?
  • Decide on what is sensible column names for the returned DataFrame (L1 and L2 might not be intuitive for many people)
  • Ability to instantiate a observation (VirtualObservation??) from a VirtualRealization (implies ScratchRealization), this would allow "finding the realization closest to mean-FOPT"
  • Ignore missing data
  • More logging
  • get_smry() should return empty dataframes when asked for non-existing keys, not ValueError
  • Decide if extrapolation through get_smry() is a sensible thing to do. If it is not, it requires some changes to the code.

Solve DeprecationWarning on collections

fmu-ensemble/src/fmu/ensemble/realization.py:1759: DeprecationWarning: Using or importing the ABCs from 'collections' instead of from 'collections.abc' is deprecated since Python 3.3,and in 3.9 it will stop working
    if isinstance(value, collections.MutableMapping):

Observations mismatch failes when one realization are missing the UNSMRY file.

Using the HistoryMatch container in webviz creates an EnsembleSet and mismatches from observations. The fmu.ensemble.Observations.mismatch failes when a realization has failed (does not have a summary file).

https://github.com/equinor/webviz-subsurface/blob/2d90ffc138ace636c051f3a42e52dc12af586739/webviz_subsurface/datainput/_history_match.py#L30

)[obsunit["key"]].values[0]

Traceback:

File "/.../venv/lib/python3.7/site-packages/webviz_subsurface/datainput/_history_match.py", line 30, in extract_mismatch
    .mismatch(ens_data)
  File "/.../venv/lib/python3.7/site-packages/fmu/ensemble/observations.py", line 116, in mismatch
    mismatches[(ensname, realidx)] = self._realization_mismatch(real)
  File "/.../venv/lib/python3.7/site-packages/fmu/ensemble/observations.py", line 305, in _realization_mismatch
    )[obsunit["key"]].values[0]
  File "/.../venv/lib/python3.7/site-packages/pandas/core/frame.py", line 2927, in __getitem__
    indexer = self.columns.get_loc(key)
  File "/.../venv/lib/python3.7/site-packages/pandas/core/indexes/base.py", line 2658, in get_loc
    return self._engine.get_loc(self._maybe_cast_indexer(key))
  File "pandas/_libs/index.pyx", line 108, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/index.pyx", line 132, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/hashtable_class_helper.pxi", line 1601, in pandas._libs.hashtable.PyObjectHashTable.get_item
  File "pandas/_libs/hashtable_class_helper.pxi", line 1608, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'WBP4:SOME_WELL'

Using PR #3 does not help and gives this Traceback:

  File "/.../venv/lib/python3.7/site-packages/webviz_subsurface/datainput/_history_match.py", line 30, in extract_mismatch
    .mismatch(ens_data)
  File "/.../venv/lib/python3.7/site-packages/fmu/ensemble/observations.py", line 122, in mismatch
    mismatches[(ensname, realidx)] = self._realization_mismatch(real)
  File "/.../venv/lib/python3.7/site-packages/fmu/ensemble/observations.py", line 342, in _realization_mismatch
    )[obsunit["key"]].values[0]
  File "/.../venv/lib/python3.7/site-packages/pandas/core/frame.py", line 2927, in __getitem__
    indexer = self.columns.get_loc(key)
  File "/.../venv/lib/python3.7/site-packages/pandas/core/indexes/base.py", line 2658, in get_loc
    return self._engine.get_loc(self._maybe_cast_indexer(key))
  File "pandas/_libs/index.pyx", line 108, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/index.pyx", line 132, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/hashtable_class_helper.pxi", line 1601, in pandas._libs.hashtable.PyObjectHashTable.get_item
  File "pandas/_libs/hashtable_class_helper.pxi", line 1608, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'WBP4:SOME_WELL'

Use pytest tmpdir fixture, allowing parallell testing

test_ensembleset.py at least is not parallellizable, as it can fail in parallel while working in sequential mode. This is probably race conditions on directories written to. Fix by using tmpdir fixtures in pytest, and test that pytest -n 20 never fails

Align get_smry and load_smry

For a ScratchEnsemble, get_smry() and load_smry() may differ in dates returned - get_smry() first builds a list of dates to obtain data for, and then asks every realization for data at these dates, while load_smry() asks each realization independently. This will only happen if the realizations summary data have different end dates.

It is not given that this difference is a bug and not a feature, so perhaps it should only be documented.

pyarrow dependency

pyarrow is not imported before to_disk() is called, but it is listed as a dependency

pyarrow is not in komodo, so to_disk() will probably fail in such an environment.

Consider making pyarrow an optional dependency, and have to_disk() only write CSV-files when import fails.

Pandas 0.25 compatibility, Py3.6

With Python 3.6 and Pandas 0.25, a test fails. Works fine with Python 3.6 and Pandas 0.24.

tests/test_ensemble_agg.py:44: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
../../../virtualenv/python3.6.7/lib/python3.6/site-packages/fmu/ensemble/ensemble.py:1162: in agg
    aggregated = aggobject.quantile(quantile / 100.0)
../../../virtualenv/python3.6.7/lib/python3.6/site-packages/pandas/core/groupby/groupby.py:1908: in quantile
    interpolation=interpolation,
../../../virtualenv/python3.6.7/lib/python3.6/site-packages/pandas/core/groupby/groupby.py:2238: in _get_cythonized_result
    vals, inferences = pre_processing(vals)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
vals = array(['DESIGN2PARAMS', 'DESIGN_KW', 'DESIGN_KW', 'DESIGN_KW',
       'DESIGN_KW', 'DESIGN_KW', 'MAKE_DIRECTORY', 'COP...',
       'RMS_BATCH', 'GENERATE_RELPERM', 'GENERATE_RELPERM', 'INCLUDE_PC',
       'ECLIPSE100_2014.2'], dtype=object)
    def pre_processor(vals: np.ndarray) -> Tuple[np.ndarray, Optional[Type]]:
        if is_object_dtype(vals):
            raise TypeError(
>               "'quantile' cannot be performed against " "'object' dtypes!"
            )
E           TypeError: 'quantile' cannot be performed against 'object' dtypes!

Parameter-merging pr. realization

Merging an ensemble-wide dataframe (f.ex summary) with the ensemble-wide parameters dataframe is a common task. Calling pd.merge on the these two dataframes from the outside is probably inefficient, compared to merging individually pr. realization.

Perhaps an option can be added to get_df() to perform merging to some dataset on the realization level. This will probably speed up standard operations, and also make it possible to utilize multiprocessing in this operation.

Multiprocessing

Operations over an ensembles are trivially parallelizable.

We should utilize Python multiprocessing for this.

multiprocessing is what should be used, as multithreading will suffer from GIL.

This is probably trivial for ensemble.get_smry(), but not so trivial for ensemble.from_smry(), as we need to populate each realization object with smry data in the parent process' memory space.

Maybe ensemble.from_smry() should call realization.get_smry() with multiprocessing, and then the ensemble object (holding the master process) populates each realizations self.data['unsmry-<something>'].

We must ensure CTRL-C works, which is trickier with Multiprocessing.

See this: https://stackoverflow.com/questions/11312525/catch-ctrlc-sigint-and-exit-multiprocesses-gracefully-in-python

When this is in place, we should also be able to skip issues when libecl is core-dumping due to a difficult UNSMRY-file.

Right now, your Python session will die if libecl crashes on rough data.

Reading and writing VirtualEnsembles to disk

VirtualEnsembles should be dumpable and initializable to/from disk and cloud. This is partially implemented. This issue picks up where #7 left.

ScratchEnsembles are not disk-dumpable, as they already are on disk. ScratchEnsembles must be virtualized before dumping to disk.

We need

  • Write all internalized data to disk as CSV
  • Figure out what else need to be dumped. Files-dataframe?
  • Option to choose another file format for CSV data, e.g. HDF5 or json?
  • Support internalized scalar values (#32 )
  • Be able to initialize a VirtualEnsembles from disk.
  • Be tolerant to ensembles that was written to disk with older versions of fmu.ensemble (not relevant today, but it will be!)
  • Decide if we can have CSV and HDF5 files with identical data lying next to each other. If so, decide authority - what happens if the CSV file gets updated and HDF5 not.
  • Test/decise whether all data can or should be merged into a HDF5 file. Any upside compared to the multiple files in a directory concept? What about json?
  • Decide on API. Should it be to_disk() and from_disk()?
  • Be compatible with cloud solution. Should we have different functions for this than to_disk()?
  • files dataframe must have a STOREDRELATIVEPATH. FULLPATH should perhaps be renamed to ORIGINALFULLPATH.
  • add yaml pr. file to files dataframe as additional columns
  • Support lazy load of dataframes from disk in from_disk()
  • Dump a CSV file with an index of the files dumped, except "dscovered files"

Write test for summary handling with uneven time-scale lengths

There should be a test setup to verify correctness when one realization has failed, say in the middle of the schedule period. Perturb a schedule file to end prematurely, run it, and save the UNSMRY file with a different filename that can be injected temporarily by the test-code.

Test that we can filter out by date the realization that failed.

load_smry() on that realization should not have any dates past the crash point, for any time_index argument.

get_smry() at ensemble level should be similar, differing last-date pr. realization.

The behaviour of get_smry_stats() is undefined. Uneven DATE ranges pr realization must be padded when realizations have not been excluded, as it is not given that a simulation that ends earlier represents an error and not intent. It is thinkable to add an option to get_smry_stats() to require similar end-date for all realizations.

Delta profiles should possibly exclude profiles with differing DATE columns.

realization.load_status() can crash on input

Stack-trace:

The script 'ExternalErtScript' caused an error while running:
Traceback (most recent call last):
 File "/pr../wf_well_volumes.py", line 458, in <module>
   main()
 File "/projec.......wf_well_volumes.py", line 354, in main
   ens             = ensemble.ScratchEnsemble('ens', args.scratch_dir+'/realization-*')
 File "/project/res/lib/python2.7/site-packages/fmu/ensemble/ensemble.py", line 125, in __init__
   paths, realidxregexp, autodiscovery=autodiscovery
 File "/project/res/lib/python2.7/site-packages/fmu/ensemble/ensemble.py", line 228, in add_realizations
   realdir, realidxregexp=realidxregexp, autodiscovery=autodiscovery
 File "/project/res/lib/python2.7/site-packages/fmu/ensemble/realization.py", line 137, in __init__
   self.load_status()
 File "/project/res/lib/python2.7/site-packages/fmu/ensemble/realization.py", line 463, in load_status
   hms = list(map(int, jobrow["STARTTIME"].split(":")))
ValueError: invalid literal for int() with base 10: '1/Process'

Support status.json

The STATUS-file parsing should be replaced by parsing of status.json whenever that file is available.

status.json probably appeared in Ert 2.3, ca November 2018.

Current parsing of the STATUS-file has unavoidable problems with dealing with jobs lasting more than 24 hours.

EnsembleCombinations are really slow

Ensemble arithmetics works, but is most likely suffering from some pythonic mistakes as the test code test_ensemblecombination.py is taking excessive time. Not unlikely that there is some bad things going on with garbage collection.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.