gmh14 / data_efficient_grammar Goto Github PK

[ICLR 2022] Data-Efficient Graph Grammar Learning for Molecular Generation

Home Page: https://openreview.net/forum?id=l4IHywGq6a

License: MIT License

Python 15.93% Shell 0.01% Jupyter Notebook 84.06%

grammar-learning molecule-generation graph-learning efficient-deep-learning graph-neural-networks formal-languages symbolic-representation

data_efficient_grammar's Introduction

Data-Efficient Graph Grammar Learning for Molecular Generation

This repository contains the implementation code for paper Data-Efficient Graph Grammar Learning for Molecular Generation (ICLR 2022 oral).

In this work, we propose a data-efficient generative model (DEG) that can be learned from datasets with orders of magnitude smaller sizes than common benchmarks. At the heart of this method is a learnable graph grammar that generates molecules from a sequence of production rules. Our learned graph grammar yields state-of-the-art results on generating high-quality molecules for three monomer datasets that contain only ∼20 samples each.

Installation

Prerequisites

Retro*: The training of our DEG relies on Retro* to calculate the metric. Follow the instruction here to install.
Pretrained GNN: We use this codebase for the pretrained GNN used in our paper. The necessary code & pretrained models are built in the current repo.

Conda

You can use conda to install the dependencies for DEG from the provided environment.yml file, which can give you the exact python environment we run the code for the paper:

git clone [email protected]:gmh14/data_efficient_grammar.git
cd data_efficient_grammar
conda env create -f environment.yml
conda activate DEG
pip install -e retro_star/packages/mlp_retrosyn
pip install -e retro_star/packages/rdchiral

Note: it may take a decent amount of time to build necessary wheels using conda.

Install `Retro*`:

Download and unzip the files from this link, and put all the folders (dataset/, one_step_model/ and saved_models/) under the retro_star directory.
Install dependencies:

conda deactivate
conda env create -f retro_star/environment.yml
conda activate retro_star_env
pip install -e retro_star/packages/mlp_retrosyn
pip install -e retro_star/packages/rdchiral
pip install setproctitle

Train

For Acrylates, Chain Extenders, and Isocyanates,

conda activate DEG
python main.py --training_data=./datasets/**dataset_path**

where **dataset_path** can be acrylates.txt, chain_extenders.txt, or isocyanates.txt.

For Polymer dataset,

conda activate DEG
python main.py --training_data=./datasets/polymers_117.txt --motif

Since Retro* is a major bottleneck of the training speed, we separate it from the main process, run multiple Retro* processes, and use file communication to evaluate the generated grammar during training. This is a compromise on the inefficiency of the built-in python multiprocessing package. We need to run the following command in another terminal window,

conda activate retro_star_env
bash retro_star_listener.sh **num_processes**

Note: opening multiple Retro* is EXTREMELY memory consuming (~5G each). We suggest to start from using only one process by bash retro_star_listener.sh 1 and monitor the memory usage, then accordingly increase the number to maximize the efficiency. We use 35 in the paper.

After finishing the training, to kill all the generated processes related to Retro*, run

killall retro_star_listener

Use DEG

Download and unzip the log & checkpoint files from this link. See visualization.ipynb for more details.

Acknowledgements

The implementation of DEG is partly based on Molecular Optimization Using Molecular Hypergraph Grammar and Hierarchical Generation of Molecular Graphs using Structural Motifs .

Citation

If you find the idea or code useful for your research, please cite our paper:

@inproceedings{guo2021data,
  title={Data-Efficient Graph Grammar Learning for Molecular Generation},
  author={Guo, Minghao and Thost, Veronika and Li, Beichen and Das, Payel and Chen, Jie and Matusik, Wojciech},
  booktitle={International Conference on Learning Representations},
  year={2021}
}

Contact

Please contact [email protected] if you have any questions. Enjoy!

data_efficient_grammar's People

Contributors

Stargazers

Watchers

data_efficient_grammar's Issues

ZeroDivisionError: division by zero

Hello, thank you for your wonderful project. There were some problems when I tried to apply code on a polymer dataset using --motif command. Is this the problem of my SMILES strings, or other possible problems, and how should I solve it?

Graph contraction status: True
Start grammar evaluation...
Generating sample 0/100
Traceback (most recent call last):
File "main.py", line 211, in
learn(mol_sml, args)
File "main.py", line 130, in learn
eval_metric = evaluate(l_grammar, args, metrics=['diversity', 'syn'])
File "main.py", line 33, in evaluate
mol, iter_num = random_produce(grammar)
File "D:\D_E_G\grammar_generation.py", line 162, in random_produce
_, idx = sample(starting_rules)
File "D:\D_E_G\grammar_generation.py", line 139, in sample
prob = [1/len(l)] * len(l)
ZeroDivisionError: division by zero

Having problems with environment.yml

Hi! Thanks for the code . I'm having issues with the environment

error: command 'gcc' failed with exit status 1
----------------------------------------
ERROR: Command errored out with exit status 1: /home/********/anaconda3/envs/DEG2/bin/python -u -c 'import io, os, sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-install-1szh297k/torch-sparse_dfcc41865d254db5ace139557449ac53/setup.py'"'"'; file='"'"'/tmp/pip-install-1szh297k/torch-sparse_dfcc41865d254db5ace139557449ac53/setup.py'"'"';f = getattr(tokenize, '"'"'open'"'"', open)(file) if os.path.exists(file) else io.StringIO('"'"'from setuptools import setup; setup()'"'"');code = f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, file, '"'"'exec'"'"'))' install --record /tmp/pip-record-488ahuy4/install-record.txt --single-version-externally-managed --compile --install-headers /home/kiwoong/anaconda3/envs/DEG2/include/python3.6m/torch-sparse Check the logs for full command output.
/
failed

CondaEnvException: Pip failed

pytorch geometric doesn't seem to work

Inability to process SMILES with certain SMILES characters

Hello,
My team and I are interested in using your package to facilitate the generation of new molecules with potential to have a certain type of toxicity. In order to do that, we explored the ability of your package to use a user-defined scoring function as the center of the training protocol. In addition to that, a new dataset is required to complete the training process. Although your package proved to be able to carry out training using user-defined functions, it seems to have some issues handling SMILES representations that contain certain characters. After some investigation, it seems like the package fails to process SMILES that contain / and \ characters which are used to indicate the cis and trans positions of atoms. we were wondering if there exist an easy fix to this problem and if yes, what should be done to fix that issue.

This the error it keeps showing
data processing 0/44
Traceback (most recent call last):
File "main.py", line 235, in
learn(mol_sml, args)
File "main.py", line 121, in learn
subgraph_set_init, input_graphs_dict_init = data_processing(smiles_list, args.GNN_model_path, args.motif)
File "/home/qspt_user/data_efficient_grammar/grammar_generation.py", line 42, in data_processing
subgraphs.append(SubGraph(subgraph_i_mapped, mapping_to_input_mol=subgraph_i_mapped, subfrags=list(cluster)))
File "/home/qspt_user/data_efficient_grammar/private/molecule_graph.py", line 91, in init
super(SubGraph, self).init(mol, is_subgraph=True, mapping_to_input_mol=mapping_to_input_mol)
File "/home/qspt_user/data_efficient_grammar/private/molecule_graph.py", line 15, in init
self.hypergraph = mol_to_hg(mol, kekulize=True, add_Hs=False)
File "/home/qspt_user/data_efficient_grammar/private/hypergraph.py", line 744, in mol_to_hg
bipartite_g = mol_to_bipartite(mol, kekulize)
File "/home/qspt_user/data_efficient_grammar/private/hypergraph.py", line 692, in mol_to_bipartite
mol = standardize_stereo(mol)
File "/home/qspt_user/data_efficient_grammar/private/hypergraph.py", line 938, in standardize_stereo
atom_idx_1 = each_bond.GetStereoAtoms()[0]
IndexError: Index out of range

Missing instructions from installation

apt install -y libxrender-dev