Giter VIP home page Giter VIP logo

taml's Introduction

Theory aware Machine Learning (TaML)

This repository supports the following manuscript

Debra J. Audus, Austin McDannald, and Brian DeCost, "Leveraging Theory for Enhanced Machine Learning" ACS Macro Letters 2022 11 (9), 1117-1122 DOI: 10.1021/acsmacrolett.2c00369,

which explores methods for incorporating imperfect theory into machine learning for improved prediction and explainability. Specifically, it focuses on the case study of the dimensions of a polymer chain, in this case the radius of gyration, in different solvent qualities. For machine learning models, three models are considered: Gaussian Process Regression with heteroscedastic noise, Gaussian Process Regression with homoscedastic noise and Random Forest. Of the three models, we encourage use of Gaussian Process Regression with heteroscedastic noise as it provides accurate uncertainty estimates.

Gaussian Process Regression with heteroscedastic noise relies on the GPFlow python package. However, since heteroscedastic noise is not natively implemented, we implement a derived class to add this functionality (see taml/GPRhetero.py). Gaussian Process Regression with homoscedastic noise is implemented natively with GPFlow. Random Forest is implemented using Scikit-learn.

The repository is intended for the following use cases:

  • Illustrate key ideas from the manuscript including incorporating theory and using Gaussian Process Regression with heteroscedastic noise (see notebooks/MethodComparison_GPR_HeteroscedasticNoise and the companion notebook without heteroscedastic noise notebooks/MethodComparison_GPR_HomoscedasticNoise
  • Provide code for Gaussian Process Regression with heteroscedastic noise (which can be used after installation with from taml.GPRhetero import GPRhetero).
  • Reproduce figures from our manuscript (see notebooks folder)
  • Allow for full reproducibility of the data in the manuscript

Running the code

All code is written in Python and requires Python >= 3.7. It can be used on any operating system. Other requirements are listed in requirements.txt.

If you are only interested in running the Jupyter Notebooks in Google Colab, you can skip ahead to Notebooks.

First clone the code via

git clone https://github.com/usnistgov/TaML.git

and navigate to the directory where the repository lives

cd TaML

Next, one needs to create a virtual environment. This can be done using Python virtual environments or with Anaconda. Both options are listed below.

Create a Python virtual environment (option 1)

First, make sure you are using Python 3.7 or later.

python3 -m venv env

where env is the location of the virtual environment

Activate the virtual environment

source env/bin/activate

Install dependencies

python3 -m pip install -r requirements.txt

Create a virtual environment with Anaconda (option 2)

First, install conda.

conda env create -f environment.yml

If you are using conda>=4.6, activate the virtual environment via

conda activate TaML

Otherwise, see the conda docs

GPFlow 2.2.1 is not available on conda channels and must be installed via pip

pip install gpflow==2.2.1

Install the TaML package

For users who wish to use the source code or import functions, the TaML package can be installed via

pip install .

Notebooks

Included notebooks include DataVisualization for visualizing the input data used for machine learning, MethodComparison_GPR_HeteroscedasticNoise for comparing different methods for incorporating theory into machine learning using Gaussian Process Regression with heteroscedastic noise, MethodComparison_GPR_HomoscedasticNoise for comparing different methods for incorporating theory into machine learning using Gaussian Process Regression with homoscedastic noise, and ViewResults for plotting the relative performance of different methods for incorporating theory into machine learning for three different machine learning models.

Running notebooks locally (option 1)

For users interested in testing ideas, we recommend focusing on the MethodComparison_GPR_HeteroscedasticNoise notebook as it explores the different methods and takes into account the known uncertainties in the input data.

If you cloned the repository, the Jupyter notebooks can by run by navigating to the notebook folder and using the command

jupyter notebook

Running notebooks in Google Colab (option 2)

If you are interested in running one or more notebooks in Google Colab, first click on the relevant link below. Note that these links were generated by navigating to the notebook of interest on the TaML GitHub page, for example, https://github.com/usnistgov/TaML/blob/main/notebooks/MethodComparison_GPR_HeteroscedasticNoise.ipynb and then replace github.com with githubtocolab.com.

This should open the notebook in Google Colab. For the DataVisualization and ViewResults notebooks, all dependencies are likely available and you should be able to directly run them. For the MethodComparison_GPR_HeteroscedasticNoise and MethodComparison_GPR_HomoscedasticNoise notebooks, you must install GPFlow. This can be accomplished by

(1) uncommenting out the code block

!pip install gpflow==2.2.1

(2) executing the code block

(3) restarting the run time environment (there should be a button at the bottom of the output for that code block).

Then you can run the notebook as normal.

Source code

The source code (see the taml folder) compares a variety of methods for incorporating theory into machine learning for three different machine learning models: Gaussian Process Regression with heteroscedastic noise, Gaussian Process Regression with homoscedastic noise and Random Forest. The output of the files can be plotted by modifying the notebook title ViewResults such that the data files are pulled from a local run as opposed to the stored data.

To run the source code

python3 -m taml

Contact

Debra J. Audus, PhD
Polymer Analytics Project
Materials Science and Engineering Division
Material Measurement Laboratory
National Institute of Standards and Technology

Email: [email protected]
GithubID: @debraaudus
Project website: https://www.nist.gov/programs-projects/polymer-analytics
Staff website: https://www.nist.gov/people/debra-audus

How to cite

If you use the code, please cite our manuscript:

Debra J. Audus, Austin McDannald, and Brian DeCost ACS Macro Letters 2022 11 (9), 1117-1122 DOI: 10.1021/acsmacrolett.2c00369

If you use the data, please cite:

Audus, Debra, MacDannald, Austin, DeCost, Brian (2022), Theory aware Machine Learning (TaML), National Institute of Standards and Technology, https://doi.org/10.18434/mds2-2637 (Accessed YYYY-MM-DD)

taml's People

Contributors

debraaudus avatar

Stargazers

bluestonebasilica avatar  avatar Quiet Koan avatar Carly Travis avatar  avatar Franco Aquistapace avatar Amorleinis/Asmerella/Alluethrenn avatar Jiale Shi avatar

Watchers

James Cloos avatar Victor Karamalis avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.