justingosses / predictatops Goto Github PK

Stratigraphic pick prediction via supervised machine-learning

Home Page: https://justingosses.github.io/predictatops/html/index.html

License: MIT License

Makefile 0.97% Python 93.84% Jupyter Notebook 5.18%

geology stratigraphy geoscience machine-learning hackathon-project well-logs dataunderground athabasca athabasca-preprocessed

predictatops's Introduction

predictatops

Code for stratigraphic pick prediction via supervised machine-learning

THIS REPOSITORY HAS BEEN ARCHIVED TO SIGNIFY THERE WILL NOT BE ADDITIONAL WORK. HOWEVER, IT IS HAS ALWAYS BEEN A PROOF OF CONCEPT OF AN APPROACH RATHER THAN A TOOL SO YOUR USE OF IT SHOULD NOT REALLY CHANGE. YOU CAN STILL STAR OR FORK IT.

Status: Runs and ready for others to try. This code project is most useful as a working proof-of-concept. It is not optimized to be used in a plug-n-play or as a dependency. Updated to v0.0.4-alpha October 26th, 2019. Updates to dependencies are done but not frequently. NOTE: Running in a standard google colab notebook may fail during model training due to memory requirements exceeding the default initial amount of RAM.

Current best RMSE on Top McMurray surface is 6.6 meters.

Philosophy

In human-generated stratigraphic correlations there is often talk of lithostratigraphy vs. chronostratigraphy. We propose there is a weak analogy between lithostratigraphy and chronostratigraphy and the different methods of computer assisted stratigraphy. Some of the past efforts, which work very well under certain circumstances, are similar to lithostratigraphy in terms of what they accomplish. They match curve patterns between neighboring wells and rely on the assumption that changes in lithology ~ curve shapes are equivelant to stratigraphy.

Other papers attempt to use code to correlate well logs assuming there was a mathematical or pattern basis for stratigraphic surfaces that can be teased out of individual logs. Although there are recent papers that seem to do better with this type of approach, no code was released, the earlier ones seem to have problems that at least in part were related to their assumption that stratigraphic changes had similar expression across large spatial areas.

In contrast to lithostratigraphy, chronostratigraphy assumes lithology equates to facies belts that can fluctuate gradually in space over time, and are not correlated with time. Two wells with similar lithology patterns can be in different time packages. Traditional chronostratigraphy relies on models of how facies belts should change in space when not otherwise constrained by biostratigraphy, chemostratigraphy, or radiometric dating.

Instead of relying on stratigraphic models, this project proposes known picks can define spatial distribution of, and variance of, well log curve patterns that are then used to predict picks in new wells. This project attempts to focus on creating programatic features and operations that mimic the low level observations of a human geologist and progressively build into higher order clustering of patterns occuring across many wells that would have been done by a human geologist.

Datasets

The default demo dataset used is a collection of over 2000 wells made public by the Alberta Geological Survey's Alberta Energy Regulator. To quote their webpage, "In 1986, Alberta Geological Survey began a project to map the McMurray Formation and the overlying Wabiskaw Member of the Clearwater Formation in the Athabasca Oil Sands Area. The data that accompany this report are one of the most significant products of the project and will hopefully facilitate future development of the oil sands." It includes well log curves as LAS files and tops in txt files and xls files. There is a word doc and a text file that describes the files and associated metadata.

Wynne, D.A., Attalla, M., Berezniuk, T., Brulotte, M., Cotterill, D.K., Strobl, R. and Wightman, D. (1995): Athabasca Oil Sands data McMurray/Wabiskaw oil sands deposit - electronic data; Alberta Research Council, ARC/AGS Special Report 6.

Please go to the links below for more information and the dataset:

Report for Athabasca Oil Sands Data McMurray/Wabiskaw Oil Sands Deposit http://ags.aer.ca/document/OFR/OFR_1994_14.PDF

Electronic data for Athabasca Oil Sands Data McMurray/Wabiskaw Oil Sands Deposit http://ags.aer.ca/publications/SPE_006.html Data is also in the repo folder: SPE_006_originalData of the original repo for this project here.

In the metadata file SPE_006.txt the dataset is described as Access Constraints: Public and Use Constraints: Credit to originator/source required. Commercial reproduction not allowed.

The Latitude and longitude of the wells is not in the original dataset. @dalide used the Alberta Geological Society's UWI conversion tool to find lat/longs for each of the well UWIs. A CSV with the coordinates of each well's location can be found here. These were then used to find each well's nearest neighbors.

Please note that there are a few misformed .LAS files in the full dataset, so the code in this repository skips those.

If for some reason the well data is not found at the links above, you should be able to find it here.

Architecture and Abstraction

PLEASE REFER TO THE SECTION Architecture and Abstraction in the DOCs. Information is provided on code architecture, tasks, and folder organization.

GettingStarted

See the Usage and the Installation sections of the docs.

Credits

There's a theme here. Check the docs.

Status

The root mean squared error for the Top McMurray surface is down to ~7 meters (with a handful of wells identified as too difficult to predict, -8% depending on settings).

Distribution of Absolute Error in Test Portion of Dataset for Top McMurray Surface in Meters.

Y-axis is number of picks in each bin, and X-axis is distance predicted pick is off from human-generated pick.

Current algorithm used is XGBoost.

predictatops's People

Contributors

Stargazers

Watchers

Forkers

nikitaboyko trqmorgan khushal17ad caceco5 skylex72rus abv-hub david-thul bluetyson wassemalward krishna999 gostopy codingmylife richardscottoz akhilvreddy baylejs heidihello claraduque

predictatops's Issues

Add images from slides to docs under a theory section

Add images from slides to docs under a theory section https://onedrive.live.com/redir?resid=C028D856AB4461D!2889&authkey=!AAJOHk5oEZJhcuk&ithint=file%2cpptx

create feature that identifies sharp jumps

There's some talk of different ways to do this in SWUNG Slack channel.

missing / in path to CSV for pics

Describe the bug
When trying to clone the package into co-lab and run all_runner.py it seems to fall on importing the PICKS csv file because a / is missing in https://github.com/JustinGOSSES/predictatops/blob/master/predictatops/configurationplusfiles.py on line

self.picks_dic = self.data_directory + "OilSandsDB/PICKS.TXT"
which should be self.picks_dic = self.data_directory + "OilSandsDB/PICKS.TXT"
`

To Reproduce
Clone into co-lab and follow normal instructions in docs.

Expected behavior
The CSV inport shouldn't fail.

Desktop (please complete the following information):
colab 2020-05-01

reminder to fix this tomorrow on different computer.

Would be interesting in trying Predictatops on this new dataset from Wyoming!

Is your feature request related to a problem? Please describe.
Would be interesting in trying Predictatops on this new dataset from Wyoming!
https://sites.google.com/a/wyo.gov/oil-and-gas/prb-study

Describe the solution you'd like
A tutorial that runs the prediction

Describe alternatives you've considered
A separate repo

improve saving of prediction results

Write a better save results & map results.

something in steps 1-3 stores path to files instead of filenames, may have been accidental change during documentation?

Describe the bug
User reported via email a bug having to do with the import script where the intersection of file names and names of wells fails as the full path is used for names of wells to import against!

To Reproduce
Steps to reproduce the behavior:

Run notebook for first 3 steps
Try to use import_runner.py on the results of notebook for first three steps.

Expected behavior
The comparison operation should find approximately 1200 well names that match well files

Screenshots

Desktop (please complete the following information):
unknown

Additional context
If you emailed me on this issue, please add more context below! I'll try to get to this within the next week.

UMAP for visualization & feature creation step

Cluster wells using unsupervised learning and then see if clusters can be created that correlated with supervised prediction results. (initial trials with UMAP give encouraging results)

debug why some wells can't be loaded and are ignored

Expand number of places Dask is used to parallerize functions

Expand number of places Dask is used to parallerize functions. Generally try to speed things up. Dask was originally included than taken out as debugging was taking too long. Could be used in feature creation and training.

add example of mapping various attributes to ploy.py

Add example of mapping various attributes to plot.py

Error
Range of depth of top X number of depth predictions by probability
Actual top depth
Unit thickness
max or min statistic of certain curve type
availability of different curve lists

link to all the old Manville repo issues that still apply!

https://github.com/JustinGOSSES/MannvilleGroup_Strat_Hackathon/issues

Add tests

Make small test datasets with 2 different organizations.

- Plot results!

Need to investigate problem in features_runner.py

problem in features_runner.py

len(df_test5) 1302634
/Users/justingosses/anaconda/envs/MannvilleDask2/lib/python3.6/site-packages/pandas/core/generic.py:1996: PerformanceWarning:
your performance may suffer as PyTables will pickle object types that it cannot
map directly to c-types [inferred_type->mixed,key->block3_values] [items->['UWI', 'trainOrTest', 'Neighbors_Obj', 'class_DistFrPick_TopTarget', 'class_DistFrPick_TopHelper', 'closerToBotOrTop']]

return pytables.to_hdf(path_or_buf, key, self, **kwargs)

run with a different top than Top McMurry

Run with a different top than Top McMurry and post results here then suggest leader table for results in readme.

Reorganize documentation - usuage & architecture are too blended together

Visualize probabilty of pick along well

Visualize probabilty of pick along well instead of just returning max probability prediction in each well.

add methodology of Ye et al., 2017 that uses wavelet transform as feature generated step

add methodology of Ye et al., 2017 that uses wavelet transform as feature generated step.

See slide 5 of https://onedrive.live.com/edit.aspx?resid=C028D856AB4461D!2889&ithint=file%2cpptx&authkey=!AAJOHk5oEZJhcuk for example image

add catBoost and lightGBM to XGBoost functionality

Add catBoost and lightGBM to XGBoost functionality to increase speed of model training.

write up a use of yaml for running multiple attempts on the same dataset in scratch instead of everything via the config python script

Slidepack in repo or just integrate ideas into docs and just link to elsewhere as PPT are lame?

Add example notebook for each of the steps in the pipeline!

create geojson around different 'regions' based on thickness, turn into one hot vectors

Use thicknesss and potentially a residual of depth or thickness to identify sharp lines where sudden changes occur. Manually or programmatically split map into contiguous "regions". Create a feature dimension for each region. Every well is a 1 in one or more features and a zero in the rest. Features can overlap but might be better if they don't? Not sure.

need to change package name to match Python standards

Suggested names:
wellpickmimic
stratpicksupml
pickmimic
topmimic
stratpick
supervisedstratigraphy
stratipredict
strataterpretor
stratalmimic
strataxtension
pickpredict
predictatops <- gets a cool logo

Supervised Iterative Machine-Learning for Stratigraphic Interpretation Mimicry = simsim
Well-based Iterative Stratigraphic Supervised Machine-Learning = wissml
supervisedstratigraphy
supervisedstrat
superstrat - an approach for supervised machine-learning prediction of chronostratigraphic well tops.

Use H2O's automl library to try to improve on standard XGBoost approach.

clean & add to documentation

How to install
put on pypi? or wait on that until further along probably.
Provide a couple "how to use" examples.
Provide link to demo notebooks.
Add all_runner.py description and examples.
Shrink to only 1 readme , finish merging markdown into RST format and push more into docs instead of readme.
Make all the functions easier to navigate through.
Provide a page with the highest level functions only.

explore use of clustergrammer-widget for visualization large numbers of features

https://github.com/MaayanLab/clustergrammer-widget

to ensure easier use with data that comes in different formats, move merge of all input data to first bits of work.

Currently, things like fining all available curves in wells and finding neighboring wells are done with just the input files needed for those tasks, then the results merged.

If someone is bringing in geographic coordinate data that isn't in txt or CSV form, it might be easier to adapt this code if they only had to write their own imports and transforms at the very beginning.

Some of the code having to do with SiteID <=> UWI <=> Well log filesnames, for example, is transformed a couple of different times in different early modules.

In summary, It would probably make more sense to just do it once and have a big dataframe from then on. The intermediate file sizes would be bigger but the code would be less complex to adapt. If everything was a dataframe from check / load onwards.

Change fetch_demo_data.py to load from a zip file of all files instead of individual file! and then don't have to save unzipped dataset.

Is your feature request related to a problem? Please describe.
Data loads slow and has to be stored upcompressed as fetch_demo_data.py loads files individually instead of loading as zip file and then unzips into place
Describe the solution you'd like
all demo data in single zip file and everything else works as normal

Describe alternatives you've considered
Leave as is. take away zip file as its duplicate
Alternatively, put it in another location and load from there?

Additional context