Giter VIP home page Giter VIP logo

gpr-hat-barrier-prediction's Introduction

Code repository for

Evgeni Ulanov1, Ghulam A. Qadir1, Kai Riedmiller1, Pascal Friederich23, Frauke Gräter14


This repository is a collection of python scripts used in the paper Predicting hydrogen atom transfer energy barriers using Gaussian process regression

Abstract

Predicting reaction barriers for arbitrary configurations based on only a limited set of density functional theory (DFT) calculations would render the design of catalysts or the simulation of reactions within complex materials highly efficient. We here propose Gaussian process regression (GPR) as a method of choice if DFT calculations are limited to hundreds or thousands of barrier calculations. For the case of hydrogen atom transfer in proteins, an important reaction in chemistry and biology, we obtain a mean absolute error of 3.23 kcal/mol for the range of barriers in the data set using SOAP descriptors and similar values using the marginalized graph kernel. Thus, the two GPR models can robustly estimate reaction barriers within the large chemical and conformational space of proteins. Their predictive power is comparable to a graph neural network-based model, and GPR even outcompetes the latter in the low data regime. We propose GPR as a valuable tool for an approximate but data-efficient model of chemical reactivity in a complex and highly variable environment.

repo_picture.png

Top to bottom:
(A) Hydrogen Atom Transfer (HAT) reaction modelled with hydrogen atom moving start to end in equidistant steps.
(B) Environments used to capture the local environments using SOAP vectors.
(C) Illustration of a molecular HAT reaction captured as a graph.

How to reproduce the results of the paper

Note

This guide assumes a cluster type environment with SLURM & MPI installed and might not work on a local computer.

  1. Clone the repository.
  2. Download the structures from heidata of Heidelberg University and place them, as well as the metadata.csv into data/pdb as described in pdb_to_atoms.py. Afterward the structures need to be unzipped:
    $ unzip dataset_traj.zip && mv dataset_2208_traj traj
    $ unzip dataset_synth.zip && mv dataset_2208_synth synth
  3. Create local.env in the project root file with variables:
    • PROJECTROOT (absolute path to the root of the project)
    • CONDABIN (absolute path to the conda executable, e.g. CONDABIN="/.../miniconda3/bin/conda")
    • OPENMPIMODULE - The OpenMPI module to be loaded in the HPC environment, e.g. "OpenMPI/3.1.4-GCC-8.3.0"
  4. Install the conda environments main_gpr_env, mgk_gpr_env and painn_env by executing install.sh inside the install folder (make executable first with chmod).
  5. Extract the ASE atom structures from the pdb files by running pdb_to_atoms.py (Environment: main_gpr_env).
  6. Run atoms_to_soap.py (Environment: main_gpr_env) with config = dict(i_position=x, [...]) for x=0, 5 and 10 once each and note the corresponding SD-*.npy name.
  7. Insert the correct SOAP distance files from the last step into MAIN_RUN.py & SOAP_second_stage.py by editing the config=(soaps={"s_0.0": x, "s_5.0": y, "s_10.0": z, [...]) dictionary.
  8. Run submit_SOAP_GPR.sh with sbatch which does all the SOAP GPR calculations.
  9. Train the PaiNN models using submit_data_efficiency.sh with sbatch
  10. Run analyse_GPR_and_PaiNN_results.py (Environment: main_gpr_env) to collect the results of PaiNN and SOAP.
  11. Train PaiNN models for the two stage learning with submit_two_stage_learning.sh
  12. Collect the PaiNN two stage learning predictions with submit_collect_painn_predictions.sh
  13. Train SOAP model on difference of PaiNN predictions with submit_SOAP_second_stage.sh
  14. Create the MGK covariance matrix with submit_calculate_K.sh
  15. Find optimal parameters of MGK with optimize_kernel_parameters.py (Environment: mgk_gpr_env)
  16. Figure of prediction comparison of SOAP GPR, MGK GPR and PaiNN: traj_predictions.py (Environment: main_gpr_env)
  17. Figure of the data efficiency: data_efficiency_SOAP_PaiNN.py (Environment: main_gpr_env)
  18. Figure of coverage and negative interval score: plot_interval_score.py (Environment: main_gpr_env)
  19. Figure of two stage learning MAE of SOAP GPR with PaiNN: plot_two_stage_predictions.py (Environment: main_gpr_env)

Hardware and Software used:

  • CPU: Intel(R) Xeon(R) Gold 6230
  • GPU: NVIDIA GeForce RTX 2080 SUPER
  • Conda version: 22.11.1
  • OpenMPI: 3.1.4
  • The individual packages used, including version numbers, can be found in the folders of install.

Atomic structures and data

The structures and energies used in this project can be downloaded from heiDATA of Heidelberg University.

Footnotes

  1. Heidelberg Institute for Theoretical Studies, Heidelberg, Germany 2 3 4

  2. Institute of Theoretical Informatics, Karlsruhe Institute of Technology, Karlsruhe, Germany

  3. Institute of Nanotechnology, Karlsruhe Institute of Technology, Eggenstein-Leopoldshafen, Germany

  4. Interdisciplinary Center for Scientific Computing, Heidelberg, Germany

gpr-hat-barrier-prediction's People

Contributors

evulan avatar

Stargazers

KRiedmiller avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.