Giter VIP home page Giter VIP logo

aviary's Introduction

Aviary

License: MIT GitHub Repo Size GitHub last commit Tests pre-commit.ci status This project supports Python 3.9+

The aim of aviary is to contain multiple models for materials discovery under a common interface, over time we hope to add more models with a particular focus on coordinate-free deep learning models.

Installation

Aviary requires torch-scatter. pip install it with

pip install torch-scatter -f https://data.pyg.org/whl/torch-2.2.1+cpu.html

Make sure you replace 2.2.1 with your actual torch.__version__ (python -c 'import torch; print(torch.__version__)') and cpu with your CUDA version if applicable.

Then install aviary from source with

pip install -U git+https://github.com/CompRhys/aviary

or for an editable source install from a local clone:

git clone https://github.com/CompRhys/aviary
pip install -e ./aviary

Example Use from CLI

To test the input files generation and cleaning/canonicalization please run:

python examples/inputs/poscar_to_df.py

This script will load and parse a subset of raw POSCAR files from the TAATA dataset and produce the datasets/examples/examples.csv and datasets/examples/examples.json files used for the next example. For the coordinate-free roost and wren models where the inputs are easily expressed as strings we use CSV inputs. For the structure-based cgcnn model we first construct pymatgen structures from the raw POSCAR files then determine their dictionary serializations before saving in a JSON format. The raw POSCAR files have been selected to ensure that the subset contains all the correct endpoints for the 5 elemental species in the Hf-N-Ti-Zr-Zn chemical system. To test each of the three models provided please run:

python examples/roost-example.py --train --evaluate --data-path examples/inputs/examples.csv --targets E_f --tasks regression --losses L1 --robust --epoch 10
python examples/wren-example.py --train --evaluate --data-path examples/inputs/examples.csv --targets E_f --tasks regression --losses L1 --robust --epoch 10
python examples/cgcnn-example.py --train --evaluate --data-path examples/inputs/examples.json --targets E_f --tasks regression --losses L1 --robust --epoch 10

Please note that for speed/demonstration purposes this example runs on only ~68 materials for 10 epochs - running all these examples should take < 30 sec. These examples do not have sufficient data or training to make accurate predictions, however, the same scripts were used for all experiments conducted as part of the development and publication of these models. Consequently understanding these examples will ensure you can deploy the models as intended for your research.

Notebooks

We also provide some notebooks that show more a more pythonic way to interact with the codebase, these examples make use of the TAATA dataset examined in the wren manuscript:

Roost Launch Codespace Open in Google Colab Launch Binder
Wren Launch Codespace Open in Google Colab Launch Binder

Cite This Work

If you use this code please cite the relevant work:

roost - Predicting materials properties without crystal structure: Deep representation learning from stoichiometry. [Paper] [arXiv]

@article{goodall_2020_predicting,
  title={Predicting materials properties without crystal structure: Deep representation learning from stoichiometry},
  author={Goodall, Rhys EA and Lee, Alpha A},
  journal={Nature Communications},
  volume={11},
  number={1},
  pages={1--9},
  year={2020},
  publisher={Nature Publishing Group}
}

wren - Rapid Discovery of Stable Materials by Coordinate-free Coarse Graining. [Paper] [arXiv]

@article{goodall_2022_rapid,
  title={Rapid discovery of stable materials by coordinate-free coarse graining},
  author={Goodall, Rhys EA and Parackal, Abhijith S and Faber, Felix A and Armiento, Rickard and Lee, Alpha A},
  journal={Science Advances},
  volume={8},
  number={30},
  pages={eabn4117},
  year={2022},
  publisher={American Association for the Advancement of Science}
}

cgcnn - Crystal Graph Convolutional Neural Networks for an Accurate and Interpretable Prediction of Material Properties. [Paper] [arXiv]

@article{xie_2018_crystal,
  title={Crystal graph convolutional neural networks for an accurate and interpretable prediction of material properties},
  author={Xie, Tian and Grossman, Jeffrey C},
  journal={Physical review letters},
  volume={120},
  number={14},
  pages={145301},
  year={2018},
  publisher={APS}
}

Disclaimer

This research code is provided as-is. We have checked for potential bugs and believe that the code is being shared in a bug-free state.

aviary's People

Contributors

comprhys avatar deepsource-autofix[bot] avatar janosh avatar pre-commit-ci[bot] avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

aviary's Issues

test_regression_metrics + test_classification_metrics

We may want to port these tests for ml_matrics/metrics.py to aviary:

import numpy as np

from ml_matrics import classification_metrics, regression_metrics

y_binary, y_proba, y_clf = pd.read_csv("data/rand_clf.csv").to_numpy().T
xs, y_pred, y_true = pd.read_csv("data/rand_regr.csv").to_numpy().T

def test_regression_metrics():
    metrics = regression_metrics(y_true, y_pred, verbose=False)
    assert metrics["mae"]
    assert metrics["rmse"]
    assert metrics["r2"]


def test_regression_metrics_ensemble():
    # simulate 2-model ensemble by duplicating predictions along 0-axis
    y_preds = np.tile(y_pred, (2, 1))
    metrics = regression_metrics(y_true, y_preds, verbose=False)
    assert metrics["single"]
    assert metrics["ensemble"]


def test_classification_metrics():
    y_probs = np.expand_dims(y_proba, axis=(0, 2))
    metrics = classification_metrics(y_binary, y_probs, verbose=False)
    assert metrics["acc"]
    assert metrics["roc_auc"]
    assert metrics["precision"]
    assert metrics["recall"]
    assert metrics["f1"]


def test_classification_metrics_ensemble():
    y_probs = np.expand_dims(y_proba, axis=(0, 2))
    y_probs = np.tile(y_probs, (2, 1, 1))
    metrics = classification_metrics(y_binary, y_probs, verbose=False)
    assert metrics["single"]
    assert metrics["ensemble"]

Add models that are equivalent to Roost

CrabNet and AtomSets-v0 are both equivalent to roost in that they are weighted set regression architectures. If aviary is to develop into a DeepChem for inorganic materials property prediction it might be nice to add implementations of these models.

kwarg for `verbose`

Would be nice to specify whether or not to print out info, e.g. via verbose=False or verbose=True

Automatically search for AFLOW executable

Currently we hard code the AFLOW path AFLOW_EXECUTABLE = "~/bin/aflow" but it would be better to make use of shutil.which("aflow") as this is less likely to break and we can return an error if it's not found.

Refactor `aviary/utils.py`

aviary/utils.py is definitely in need of an overhaul. Was quite hard to type it in #31 and flake8 complained about surpassing max-complexity, both of which are bad signs for API design.

Readme disclaimer correct?

This appears to be outdated?

Disclaimer

[...] As this is an archive version we will not be able to amend the code to fix bugs/edge-cases found at a later date. However, this code will likely continue to be developed at the location described in the metadata.

aviary/README.md

Lines 117 to 119 in 47462b9

## Disclaimer
This research code is provided as-is. We have checked for potential bugs and believe that the code is being shared in a bug-free state. As this is an archive version we will not be able to amend the code to fix bugs/edge-cases found at a later date. However, this code will likely continue to be developed at the location described in the metadata.

Instructions for use with custom datasets

Hi @CompRhys, curious if you could give some tips on using Roost with a custom dataset. In my case, I have the chemical formulas as a list of str and the target properties, already separate by train+val vs. test datasets. I'm looking through the Colab notebook getting things set up.

separate `fit` and `predict`

Thanks for the patience with all the posts.

It seems that the train and test data is passed in all at once. Ideally, I'd like to use RooSt in an sklearn-esque "instantiate, fit, and predict" style; it's not urgent, timescale is about a month. Since I'm not familiar with the underlying code, thought I would ask before diving in. Any thoughts/suggestions on this?

TypeError: 'NoneType' object is not iterable

I installed aviary using conda based on the instructions. However, when I run the command python examples/inputs/poscar2df.py, I met the following error:

Traceback (most recent call last):
  File "examples/inputs/poscar2df.py", line 7, in <module>
    from pymatgen.core import Composition, Structure
  File "/(home path)/.conda/envs/aviary/lib/python3.7/site-packages/pymatgen/core/__init__.py", line 62, in <module>
    SETTINGS = _load_pmg_settings()
  File "/(home path)/.conda/envs/aviary/lib/python3.7/site-packages/pymatgen/core/__init__.py", line 52, in _load_pmg_settings
    d.update(d_yml)
TypeError: 'NoneType' object is not iterable

Any idea on how to solve this?

Roost Colab default Cuda version issue

Tried running the Roost example Colab and got an error that seems it's probably related to Colab now using CUDA 11.2.

OSError: libcudart.so.10.2: cannot open shared object file: No such file or directory
stack trace
OSError                                   Traceback (most recent call last)
[<ipython-input-10-fd45f7ae93a3>](https://z3go6q25tqk-496ff2e9c6d22116-0-colab.googleusercontent.com/outputframe.html?vrz=colab-20220217-060102-RC00_429270882#) in <module>()
      1 from aviary.roost.data import CompositionData, collate_batch as roost_cb
----> 2 from aviary.roost.model import Roost
      3 
      4 torch.manual_seed(0)  # ensure reproducible results
      5 

4 frames
[/usr/local/lib/python3.7/dist-packages/aviary/roost/model.py](https://z3go6q25tqk-496ff2e9c6d22116-0-colab.googleusercontent.com/outputframe.html?vrz=colab-20220217-060102-RC00_429270882#) in <module>()
      4 
      5 from aviary.core import BaseModelClass
----> 6 from aviary.segments import (
      7     MessageLayer,
      8     ResidualNetwork,

[/usr/local/lib/python3.7/dist-packages/aviary/segments.py](https://z3go6q25tqk-496ff2e9c6d22116-0-colab.googleusercontent.com/outputframe.html?vrz=colab-20220217-060102-RC00_429270882#) in <module>()
      1 import torch
      2 import torch.nn as nn
----> 3 from torch_scatter import scatter_add, scatter_max, scatter_mean
      4 
      5 

[/usr/local/lib/python3.7/dist-packages/torch_scatter/__init__.py](https://z3go6q25tqk-496ff2e9c6d22116-0-colab.googleusercontent.com/outputframe.html?vrz=colab-20220217-060102-RC00_429270882#) in <module>()
     14     spec = cuda_spec or cpu_spec
     15     if spec is not None:
---> 16         torch.ops.load_library(spec.origin)
     17     elif os.getenv('BUILD_DOCS', '0') != '1':  # pragma: no cover
     18         raise ImportError(f"Could not find module '{library}_cpu' in "

[/usr/local/lib/python3.7/dist-packages/torch/_ops.py](https://z3go6q25tqk-496ff2e9c6d22116-0-colab.googleusercontent.com/outputframe.html?vrz=colab-20220217-060102-RC00_429270882#) in load_library(self, path)
    108             # static (global) initialization code in order to register custom
    109             # operators with the JIT.
--> 110             ctypes.CDLL(path)
    111         self.loaded_libraries.add(path)
    112 

[/usr/lib/python3.7/ctypes/__init__.py](https://z3go6q25tqk-496ff2e9c6d22116-0-colab.googleusercontent.com/outputframe.html?vrz=colab-20220217-060102-RC00_429270882#) in __init__(self, name, mode, handle, use_errno, use_last_error)
    362 
    363         if handle is None:
--> 364             self._handle = _dlopen(self._name, mode)
    365         else:
    366             self._handle = handle

OSError: libcudart.so.10.2: cannot open shared object file: No such file or directory

How to predict on new materials with saved pytorch file

I used roost-example.py and saved the trained model in a pytorch file (e.g., roost.pt). I have tried to load this file and predict as follows:

targets=["E_f"]
tasks=["regression"]
task_dict = dict(zip(targets, tasks))
df = pd.read_csv('candidate_compositions.csv')
X = CompositionData(df, elem_embedding = "matscholar200", task_dict = task_dict)

model = torch.load('models/roost.pt')
y_pred = model.predict(X)

and I get the following output:

Traceback (most recent call last):
  File "~/opt/anaconda3/envs/aviary/lib/python3.7/site-packages/pandas/core/indexes/base.py", line 3361, in get_loc
    return self._engine.get_loc(casted_key)
  File "pandas/_libs/index.pyx", line 76, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/index.pyx", line 108, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/hashtable_class_helper.pxi", line 5198, in pandas._libs.hashtable.PyObjectHashTable.get_item
  File "pandas/_libs/hashtable_class_helper.pxi", line 5206, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'E_f'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "roost-predict.py", line 12, in <module>
    y_pred = model.predict(X)
  File "~/opt/anaconda3/envs/aviary/lib/python3.7/site-packages/torch/autograd/grad_mode.py", line 28, in decorate_context
    return func(*args, **kwargs)
  File "~/opt/anaconda3/envs/aviary/lib/python3.7/site-packages/aviary/core.py", line 357, in predict
    data_loader, disable=True if not verbose else None
  File "~/opt/anaconda3/envs/aviary/lib/python3.7/site-packages/tqdm/std.py", line 1173, in __iter__
    for obj in iterable:
  File "~/opt/anaconda3/envs/aviary/lib/python3.7/site-packages/aviary/roost/data.py", line 126, in __getitem__
    targets.append(Tensor([row[target]]))
  File "~/opt/anaconda3/envs/aviary/lib/python3.7/site-packages/pandas/core/series.py", line 942, in __getitem__
    return self._get_value(key)
  File "~/opt/anaconda3/envs/aviary/lib/python3.7/site-packages/pandas/core/series.py", line 1051, in _get_value
    loc = self.index.get_loc(label)
  File "~/opt/anaconda3/envs/aviary/lib/python3.7/site-packages/pandas/core/indexes/base.py", line 3363, in get_loc
    raise KeyError(key) from err
KeyError: 'E_f'

Is it possible to add an example script to perform a prediction from a saved model?

Thank you

Is there matbench benchmark result for Wrenformer?

I saw in the commit history that you have conducted some experiments in matbench benchmark, It's a very good idea and model, but I may not have enough computational resources to run it, I would like to know if you have final resuts?

Git Surgery Plan

In developing this code at several times I've been sloppy about committing large files to the git history. If we would like others to commit we would also like it to show a more accurate representation of their contribution in terms of relative LOC. Consequently we're going to carry out some git surgery before out first official release.

The following is useful to identify large files in the git history:

git rev-list --objects --all |   git cat-file --batch-check='%(objecttype) %(objectname) %(objectsize) %(rest)' |   sed -n 's/^blob //p' |   sort --numeric-sort --key=2 |   cut -c 1-12,41- |   $(command -v gnumfmt || echo numfmt) --field=2 --to=iec-i --suffix=B --padding=7 --round=nearest

The following are some of the proposed clean-up commands.

git filter-branch --force --index-filter "git rm -r --cached --ignore-unmatch data/" --prune-empty --tag-name-filter cat -- --all
git filter-branch --force --index-filter "git rm -r --cached --ignore-unmatch *.pth.tar" --prune-empty --tag-name-filter cat -- --all
git filter-branch --force --index-filter "git rm -r --cached --ignore-unmatch notebooks/" --prune-empty --tag-name-filter cat -- --all
git filter-branch --force --index-filter "git rm -r --cached --ignore-unmatch examples/colab/" --prune-empty --tag-name-filter cat -- --all
git filter-branch --force --index-filter "git rm -r --cached --ignore-unmatch results/" --prune-empty --tag-name-filter cat -- --all
git filter-branch --force --index-filter "git rm -r --cached --ignore-unmatch examples/plots/" --prune-empty --tag-name-filter cat -- --all

Colab example notebooks will be re-added but ensuring that their output is cleaned.

How to extract learned representation for transfer learning

Is there a way to extract the Roost learned representation for a certain task (e.g., formation energy prediction) and use it for another task (e.g., cohesive energy prediction)? I have looked at the Roost example and also tried converting the tensor from aviary.roost.model.DescriptorNetwork.forward to a DataFrame, but the size doesn't match my training set size.

`Roost.forward()` and `Wren.forward()` return type tuple or generator?

Going through the codebase to add type hints reminded me that I meant to ask if the return type Generator[Tensor, None, None] for Roost.forward() and Wren.forward() is intended?

def forward(self, elem_weights, elem_fea, self_fea_idx, nbr_fea_idx, cry_elem_idx):
"""
Forward pass through the material_nn and output_nn
"""
crys_fea = self.material_nn(
elem_weights, elem_fea, self_fea_idx, nbr_fea_idx, cry_elem_idx
)
crys_fea = F.relu(self.trunk_nn(crys_fea))
# apply neural network to map from learned features to target
return (output_nn(crys_fea) for output_nn in self.output_nns)

If the type should be tuple[Tensor] instead, we'd need to change to

return tuple(output_nn(crys_fea) for output_nn in self.output_nns) 

`train_ensemble` and `results_multitask` only accept `torch.utils.data.Subset`

train_ensemble() and results_multitask() are too opinionated in that they don't accept torch.utils.data.Dataset it has to be torch.utils.data.Subset which you don't get if you handle train/test splitting yourself outside of PyTorch.

In particular, you can't feed in CompositionData or WyckoffData. Throws

    357                 sample_target = Tensor(
--> 358                     train_set.dataset.df[target].iloc[train_set.indices].values
    359                 )
AttributeError: 'CompositionData' object has no attribute 'dataset'

Make pip installable

Currently pip install git+https://github.com/CompRhys/aviary succeeds but

from aviary.roost import data

fails with

ModuleNotFoundError: No module named 'aviary.roost'

Wren: Why does averaging of augmented Wyckoff positions happen inside the NN, after message passing?

https://www.science.org/doi/epdf/10.1126/sciadv.abn4117

The categorization of Wyckoff positions depends on a choice of origin (50). Hence, there is not a unique mapping between the crystal structure and the Wyckoff representation. To ensure that the model is invariant to the choice of origin, we perform on-the-fly augmentation of Wyckoff positions with respect to this choice of origin (see Fig. 6). The augmented representations are averaged at the end of the message passing stage to provide a single representation of equivalent Wyckoff representations to the output network. By pooling at this point, we ensure that the model is invariant and that its training is not biased toward materials for which many equivalent Wyckoff representations exist.

Probably a noob question here. I think I understand that it needs to happen at some point, but why does it need to happen after message passing? Why not implement this at the very beginning (i.e. in the input data representation)? Not so much doubtful of the choice as I am interested in the mechanics behind this choice. A topic that's come up in another context for me.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.